bat-literature / bat-literature.github.io

The Bat Literature Project aims to facilitate discovery of scientific literature on bats (Chiroptera)
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

request for review of 10 Sandbox Zenodo records derived from BatLit v0.4 #22

Closed jhpoelen closed 1 month ago

jhpoelen commented 1 month ago

Hi @ajacsherman @myrmoteras et al.

I've just published the following Zenodo records for your review. The records are derived from the recently published BatLit v0.4 hash://md5/b394bdb081f55916b1226b5bc8ba972a .

https://sandbox.zenodo.org/records/98713 https://sandbox.zenodo.org/records/98715 https://sandbox.zenodo.org/records/98717 https://sandbox.zenodo.org/records/98719 https://sandbox.zenodo.org/records/98721 https://sandbox.zenodo.org/records/98723 https://sandbox.zenodo.org/records/98725 https://sandbox.zenodo.org/records/98727 https://sandbox.zenodo.org/records/98729 https://sandbox.zenodo.org/records/98731

Please provide your review comments by Wednesday next week, so we have the opportunity to discuss them at the next GBatNet meeting https://globalbioticinteractions.org/gbatnet .

See also https://batlit.org .

myrmoteras commented 1 month ago

thanks for the upload

jhpoelen commented 1 month ago

@myrmoteras thanks for taking the time to have a look at the 10 BatLit Sandbox Zenodo records.

a general remark to be discussed tomorrow. Do we want to create the related works link to zotero? I personally think we should not because this is most likely a temporary link, it does even now now work when I click on it and finally I am not sure whether it is in the interest for the contributor, rather the opposite.

I agree that URLs are temporary - the same applies for DOIs, DOI URLs, Zenodo URLs etc.

And, URLs as text contain valuable metadata. For instance, DOI urls contain an encoded version of the DOI they supposed to redirect. And, for Zotero URLs the work and work ids are included as well as general notion that the work was derived from a record in Zotero. I think this is valuable information regardless of whether a HTTP GET request for a URL still happens to generate something useful.

In addition, I think adding the Zotero URLs facilitates curation of the records - if someone finds a suspicious record, the curator (e.g., @ajacsherman ) can click on the link and directly look at the related record, saving valuable time.

jhpoelen commented 1 month ago

in the description field, I would either remove the sentence (Uploaded by Plazi for the Bat Literature Project) entirely if there is a abstract, or if not make its own paragraph. This community is "owned" by the Bat Literature Project so we do not need to mention who uploaded it, unless you have something in mind.

Thanks for your thought on the description field. I think it is useful to include the role of Plazi as a company that helped to deposit the record both financially and legally.

jhpoelen commented 1 month ago

in the JSON export from https://sandbox.zenodo.org/records/98713 , the bibliographic metadata is missing, see eg the example https://zenodo.org/records/13237213 "custom_fields": { "journal:journal": { "pages": "1-8", "title": "Euscorpius", "volume": "62" }

for that it would be helpful to have editorial access to the sandbox file to understand what metadata has been uploaded.

@myrmoteras Thanks for having a look at the bibliographic metadata

As far as I can tell, the bibliographic data related to the record is available. See attached screenshots and related html landing page https://sandbox.zenodo.org/records/98713 and json page https://sandbox.zenodo.org/api/records/98713.

Screenshot from 2024-08-06 10-42-37 Screenshot from 2024-08-06 10-41-14

jhpoelen commented 1 month ago

is the resource type included "resource_type": { "id": "publication-article", "title": { "de": "Zeitschriftenartikel", "en": "Journal article" } },

Yes, the resource type is included.

see e.g.,

   "resource_type": {
      "title": "Journal article",
      "type": "publication",
      "subtype": "article"
    },

extracted from https://sandbox.zenodo.org/api/records/98713 .

jhpoelen commented 1 month ago

I would suggest to make the Bat literature Project a part of the Biodiversity Literature Repository. This makes it easier for Plazi to process articles, similar to what we do in BLR (extracting figures, treatments, etc).

Thanks for your suggestion to add BatLit records the BLR also. I think this is a curatorial decision, as the BLR community curators would have the ability to alter batlit records on Zenodo. Perhaps something to discuss tomorrow?

myrmoteras commented 1 month ago

@jhpoelen I think we should not get into the discussion whether DOIs are ephemeral like any url.

These are different beasts with different support. We don't argue that libraries are not reliable because Alexandria burned down.

The fact is that using a DOI resolves in almost all the cases and at least for the last 12 years concerning Zenodo. Whilst the Zootero does not open and depends on whether we/Aja/Cullen wants's to maintain something on a personal level.

jhpoelen commented 1 month ago

@myrmoteras thanks for your suggestion to

remove italics etc from title https://sandbox.zenodo.org/records/98715

image

I agree that having html tags in the title is not desirable. However, the tags are included in the Zotero record also, so I'd suggest to use the curatorial process as described on https://batlit.org to notify the curator of batlit about this issue.

jhpoelen commented 1 month ago

@myrmoteras I reviewed your notes and I feel that I've addressed your concerns. If you disagree, please open separate issues for each item you'd like to discuss further.

Thanks again for taking the time to review these records.

jhpoelen commented 1 month ago

@jhpoelen I think we should not get into the discussion whether DOIs are ephemeral like any url.

These are different beasts with different support. We don't argue that libraries are not reliable because Alexandria burned down.

The fact is that using a DOI resolves in almost all the cases and at least for the last 12 years concerning Zenodo. Whilst the Zootero does not open and depends on whether we/Aja/Cullen wants's to maintain something on a personal level.

if you'd like to discuss this further, please open a separate issue.

ajacsherman commented 1 month ago

Hello hello,

I vote that we keep the BLR separate from the Biodiversity Literature Repository since we will probably construct a linked search platform on BatBase for the BLR specifically. Plus, if we are providing exclusive access to the restricted papers, we want to make sure our community stays small. I can easily take care of the html tags this week. Can we include the general DOI with, or instead of, the Zenodo DOI in the citation? If friends are looking at the Zenodo page for this access, they already have that location and can search easily in the collection there. They will probably want the universal (?) DOI to cite in their bibliographies. Jorrit, you will need to share with us mere mortals about the hashes, etc. We need a walk through for the machine learning components of these records.

Here are some concerns I have;

-

Transcription errors - It will not be efficient to review all the records manually.

Searchability - We need to discuss keywords. How can we extract keywords, taxonomic names, and abstracts automatically from the published source.

Missing metadata - The metadata I entered manually is incomplete since only adding the bare minimum for a citation. I also noticed some records that had metadata extracted through Zotero are missing authors, etc.

Authors, journals, etc. consistency (C. Geiselman vs. Cullen Geiselman)

Quality control; Full metadata, Scan quality, Abstract only, etc. How will we do this in the future as friends are contributing papers to the collection. Optimally, we want colleagues to share their collections with metadata already curated.

Screening for duplicates without metadata. - If we have pdf dumps like we have had, we can not screen for duplicates until metadata has been extracted through Zotero or manually.

Gap analysis? Bad scans to be rescanned? Digitize missing records through associated libraries?

Better OCR

Will colleagues add papers through Zotero or Zenodo? Screening process?

How will we share restricted papers?

Looking forward to the next steps, Aja

On Tue, Aug 6, 2024 at 11:52 AM Jorrit Poelen @.***> wrote:

@jhpoelen https://github.com/jhpoelen I think we should not get into the discussion whether DOIs are ephemeral like any url.

These are different beasts with different support. We don't argue that libraries are not reliable because Alexandria burned down.

The fact is that using a DOI resolves in almost all the cases and at least for the last 12 years concerning Zenodo. Whilst the Zootero does not open and depends on whether we/Aja/Cullen wants's to maintain something on a personal level.

if you'd like to discuss this further, please open a separate issue.

— Reply to this email directly, view it on GitHub https://github.com/bat-literature/bat-literature.github.io/issues/22#issuecomment-2271618250, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXI3CB5N4JFLNEZDTQNSJKDZQDWKXAVCNFSM6AAAAABL5ACYCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZRGYYTQMRVGA . You are receiving this because you were mentioned.Message ID: @.***>

-- Aja Sherman MS Bat Eco-Interactions Database Curator 914-886-8906 @.*** she/her

jhpoelen commented 1 month ago

@ajacsherman thanks for taking the time to share your thoughts.

I vote that we keep the BLR separate from the Biodiversity Literature Repository since we will probably construct a linked search platform on BatBase for the BLR specifically. Plus, if we are providing exclusive access to the restricted papers, we want to make sure our community stays small.

Just so I understand what you are saying: would you like to have the batlit recaord not be submitted to the Zenodo's BLR (https://zenodo.org/communities/biosyslit) community, but only submit them to BatLit Zenodo Community (https://zenodo.org/communities/batlit ) ? This would mean that BLR community members would not have edit rights nor access to the restricted portion of the batlit corpus via Zenodo. Only Zenodo BatLit members would have.

jhpoelen commented 1 month ago

@ajacsherman please let me know when you are ready for a v0.5 release after you fixed the html tags in the titles as you mention in

I can easily take care of the html tags this week.

I'll be scanning to skies for the bat signal ; )

jhpoelen commented 1 month ago

@ajacsherman thanks for your comment on the DOI (original/zenodo)

Can we include the general DOI with, or instead of, the Zenodo DOI in the citation? If friends are looking at the Zenodo page for this access, they already have that location and can search easily in the collection there. They will probably want the universal (?) DOI to cite in their bibliographies.

I can see your point that folks would want to include the original DOI in their citation. Note that the "original" DOI is included in the Zenodo record (if available in the Zotero metadata). I imagine someone can create a little citation web widget in the batlit.org pages to generate any citation you'd like folks to use.

Due to a limitation in the Zenodo infrastructure, artifacts associated with records that have an "original" doi cannot be updated. See https://github.com/zenodo/zenodo/issues/2536 . This was also discussed in TaxoDros and other contexts, but probably with discussing again.

jhpoelen commented 1 month ago

Jorrit, you will need to share with us mere mortals about the hashes, etc. We need a walk through for the machine learning components of these records.

The methods and background to these hashes are described in the following papers:

Elliott M.J., Poelen J.H., Fortes J.A.B. (2020). Toward Reliable Biodiversity Dataset References. Ecological Informatics. https://doi.org/10.1016/j.ecoinf.2020.101132 hash://sha256/136c3c1808bcf463bb04b11622bb2e7b5fba28f5be1fc258c5ea55b3b84f482c

and/or

Elliott M.J., Poelen, J.H. & Fortes, J.A.B. (2023) Signing data citations enables data verification and citation persistence. Sci Data. https://doi.org/10.1038/s41597-023-02230-y hash://sha256/f849c870565f608899f183ca261365dce9c9f1c5441b1c779e0db49df9c2a19d

also, a more informal description / background can be found at https://linker.bio .

Happy to answer any question you may have about this.

In short - the hashes provide a unique version label for the BatLit corpus that can be used to retrieve all associated data as well as verify the authenticity of an available copy.

jhpoelen commented 1 month ago

@ajacsherman share again for sharing your concerns. I reviewed the items below the section

Here are some concerns I have;

and found that these concerns to do specifically point out suspicious records in the BatLit v0.4 . However, they point to concerns about the curatorial process and ways to make sure that the quality of the corpus can be managed efficiently.

Please let me know if any of these concerns are specific to BatLit v0.4 as I am sure I may have missed something in your description.

JelleZijlstra commented 1 month ago

Here's some feedback on the sample records. Most of those are very minor but fixing them could make the references more pleasant to use; your call whether it's worth addressing them.

jhpoelen commented 1 month ago

@JelleZijlstra thanks for sharing your notes.

For the next test round, I can make sure to include some examples of non-journal articles as you suggested in

All of the samples are journal articles. I'd be interested in seeing samples of other kinds of citations, such as books or chapters.

I'd be curious what the BatLit curator Aja Sherman @ajacsherman and digitization guru and data liberator @myrmoteras has to say about your other curatorial notes.

jhpoelen commented 1 month ago

I can confirm that the Zotero records related to:

  1. https://sandbox.zenodo.org/records/98713 (i.e., https://zotero.org/groups/5435545/items/IJI9WGI5 ) and,
  2. https://sandbox.zenodo.org/records/98715 (i.e., https://zotero.org/groups/5435545/items/EQRRREEU)

had the italic html tags <i> removed from the title.

For example -

preston cat hash://md5/26f7ce5dd404e33c6570edd4ba250d20\
 | grep EQRRREEU\
 | grep hasVersion\
 | preston cat\
 | jq .data.title

produced a title string without <i> as seen previously:

"The dawn bat, Eonycteris spelaea Dobson (Chiroptera: Pteropodidae) feeds mainly on pollen of economically important food plants in Thailand"

Similarly,

preston cat hash://md5/26f7ce5dd404e33c6570edd4ba250d20\
 | grep IJI9WGI5\
 | grep hasVersion\
 | preston cat\
 | jq .data.title

produced a title without tags:

"Divergent microclimates in artificial and natural roosts of the large-footed myotis (Myotis macropus)"

Big thanks for @ajacsherman restoring the article titles!

jhpoelen commented 1 month ago

Thanks again for your feedback. Closing this thread in prep for v0.5.