What's the workflow for curating the entries in the collection?

Daniel-Mietchen commented 4 months ago

While exploring @jhpoelen's test deposits at https://sandbox.zenodo.org/records/56163 ,

Screenshot 2024-05-17 at 19-46-34 Chiroptères du sud du Congo (Brazzaville)

I noticed that the URL https://linker.bio/cut:hash://md5/6266556325eb929125a3376ffcc982dd!/b21871-24327 that is listed there under "Related works/ is derived from" has the entry


            "filename": "Aellen and Brosset - 1968 - ChiroptÃ¨res du sud du Congo (Brazzaville).pdf",

i.e. the non-ASCII character è in Chiroptères is scrambled in a way that may cause trouble downstream. This made me wonder what the expected curation workflows are for information like this that could be handled in the Zotero group or at any later step.

Screenshot 2024-05-17 at 19-56-09 https __linker bio

jhpoelen commented 4 months ago

For this specific example, it may be that the automated workflow caused the character doi.

the original as shown in Zotero -

Chiroptères du sud du Congo (Brazzaville)

jhpoelen commented 4 months ago

on closer inspection, the observed funny characters appears to be a browser rendering issue.

This claim is supported by results of

curl 'https://linker.bio/cut:hash://md5/6266556325eb929125a3376ffcc982dd!/b21871-24327'\
 > zotero-record.json

with attached zotero-record.json

being:

{
        "key": "YPKCPU9J",
        "version": 11477,
        "library": {
            "type": "group",
            "id": 5435545,
            "name": "Bat Literature Project",
            "links": {
                "alternate": {
                    "href": "https://www.zotero.org/groups/bat_literature_project",
                    "type": "text/html"
                }
            }
        },
        "links": {
            "self": {
                "href": "https://api.zotero.org/groups/5435545/items/YPKCPU9J",
                "type": "application/json"
            },
            "alternate": {
                "href": "https://www.zotero.org/groups/bat_literature_project/items/YPKCPU9J",
                "type": "text/html"
            },
            "up": {
                "href": "https://api.zotero.org/groups/5435545/items/RIPP6IX2",
                "type": "application/json"
            },
            "enclosure": {
                "type": "application/pdf",
                "href": "https://api.zotero.org/groups/5435545/items/YPKCPU9J/file/view",
                "title": "Aellen and Brosset - 1968 - Chiroptères du sud du Congo (Brazzaville).pdf",
                "length": 15895641
            }
        },
        "meta": {
            "createdByUser": {
                "id": 13229919,
                "username": "acsherman",
                "name": "",
                "links": {
                    "alternate": {
                        "href": "https://www.zotero.org/acsherman",
                        "type": "text/html"
                    }
                }
            },
            "numChildren": 0
        },
        "data": {
            "key": "YPKCPU9J",
            "version": 11477,
            "parentItem": "RIPP6IX2",
            "itemType": "attachment",
            "linkMode": "imported_file",
            "title": "Aellen and Brosset - 1968 - Chiroptères du sud du Congo (Brazzaville).pdf",
            "accessDate": "",
            "url": "",
            "note": "",
            "contentType": "application/pdf",
            "charset": "",
            "filename": "Aellen and Brosset - 1968 - Chiroptères du sud du Congo (Brazzaville).pdf",
            "md5": "067e9a9c946e10df2a474ab32f4a506d",
            "mtime": 1714759350458,
            "tags": [],
            "relations": {},
            "dateAdded": "2024-05-03T18:02:28Z",
            "dateModified": "2024-05-03T18:02:30Z"
        }
    }

Note that the special character in Aellen and Brosset - 1968 - Chiroptères du sud du Congo (Brazzaville).pdf is being rendered as expected.

But this leaves the original question - who is in charge of curation of the Bat Lit Corpus and associated curatorial workflows.

I'd elect @ajacsherman and Cullen for the position. They are familiar with Bat Literature and have extensive background in bat research.

ajacsherman commented 4 months ago

Hey Jorrit, if the pdfs that I add to Zotero populate the parent metadata automatically, I skim it and then move on. If I need to enter the parent group metadata manually, I will add the title, date, authors, and periodical information only. I have not been monitoring browser rendering issues, transcription errors, OCR errors, DOI links, etc. If this cannot be automated and needs to be added to my workflow, I suggest we come up with a way to batch report the issues at predictable time intervals for review. We will also need to outline how we will resolve these issues independently. Did our friends from Taxadros have methods for this?

jhpoelen commented 4 months ago

Did our friends from Taxadros have methods for this?

I shared my review notes with him similar to how I am sharing them with you, then he addressed them in the next version. He has curated TaxoDros for decades and this is part of his process.

jhpoelen commented 4 months ago

I suggest we come up with a way to batch report the issues at predictable time intervals for review.

I much like your idea to outline a review process. What do you propose?

ajacsherman commented 4 months ago

How did he handle rendering issues, transcription errors, etc.? Manually? Was he merging shared libraries like we are?

jhpoelen commented 4 months ago

He managed his literature records in his own system and published the files as described in https://taxodros.github.io .

You'd have to ask him how he handles his curatorial process. I provided the review notes, and he curated the corpus and made a new version available when he was ready to do so.

ajacsherman commented 4 months ago

I suggest we come up with a way to batch report the issues at predictable time intervals for review. I much like your idea to outline a review process. What do you propose?

How do you identify errors on your end? I can produce a bibliography periodically and manually check for errors, but it seemed like you identified that one browser rendering issue that was not evident on the front end of Zotero. Are you able to address any of these errors during your indexing process if they are not clearly evident on my end (especially stable identifiers and other easily coded pathways)? Pardon my lack of knowledge, but is there a way to compare the metadata from the citation to the pdf corpus (similar to how Zotero extracts metadata to populate the parent group)? If so, we can automatically produce an error report with each batch of uploads.

ajacsherman commented 4 months ago

https://taxodros.github.io/ does not document how they treated errors, insufficient metadata, or literature links.

jhpoelen commented 4 months ago

https://taxodros.github.io/ does not document how they treated errors, insufficient metadata, or literature links.

I agree that Gerhard Bächli did not spell out his methods for curating his data. He provided the data as documented on https://taxodros.github.io . I exchanged numerous emails with @myrmoteras and Gerhard Bächli as well as opening issues https://github.com/TaxoDros/TaxoDros.github.io/issues to document and address review notes. Many of these notes are planned to be addressed in the upcoming July release.

jhpoelen commented 4 months ago

I'd be happy to provide review notes for you to curate your data. In the end, it is up to you how you'd like to deal with errors and/or suspicious records.

ajacsherman commented 4 months ago

Jorrit, are you extracting metadata from Zotero directly or from an exported bibliography? If the latter, can we address some of the transcription errors before indexing? When you export your library, Zotero produces a csv of the publications included in that library. Can we find and replace known transcription errors at this step?

jhpoelen commented 4 months ago

Jorrit, are you extracting metadata from Zotero directly or from an exported bibliography?

I extract data directly from Zotero using their API.

If the latter, can we address some of the transcription errors before indexing? When you export your library, Zotero produces a csv of the publications included in that library. Can we find and replace known transcription errors at this step?

In my mind, I version and translate the Zotero data such that Zenodo understands it. If some data corrections happen in the translation process, the provenance of the information becomes murky - How would someone be able to re-use the corpus if the original data contains errors, but they cannot be seen in some derived form (e.g., indexing in Zenodo)?

Thanks for the questions, and curious to hear your thoughts. Perhaps something to discuss tomorrow.

arw36 commented 2 months ago

there is a lot of detail that was manually (incompletely) cleaned from the original google sheet datasheet "reference_extract" before using Zotero. That sheet also retained the origin of the document in the corpus (e.g., pulled from citations or a specific person's library).

jhpoelen commented 2 months ago

@arw36 thanks for sharing your notes on the available metadata for some papers. To help move this along, would it be an idea to create a separate issue for this that includes some specific examples for publications that are currently in batlit and can be enriched with available metadata.

Curious to hear your thoughts.

arw36 commented 2 months ago

seemed relevant to the curation workflow. Feel free to move.

jhpoelen commented 2 months ago

@arw36 for sure! I followed the steps outlined in "feedback workflow" https://batlit.org/#feedback-workflow .

bat-literature / bat-literature.github.io

What's the workflow for curating the entries in the collection? #9