bio-guoda / preston

a biodiversity dataset tracker
MIT License
24 stars 1 forks source link

add support for streaming TaxoDros files #275

Closed jhpoelen closed 7 months ago

jhpoelen commented 8 months ago

add support to help stream TaxoDros source data like described in:

Bächli, G. (2024). TaxoDros - The Database on Taxonomy of Drosophilidae hash://md5/d68c923002c43271cee07ba172c67b0b hash://sha256/3e41eec4c91598b8a2de96e1d1ed47d271a7560eb6ef350a17bc67cc61255302 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10565403

DROS5.TEXT

Dieses Text-File enthält die Referenz-Fakten je Quelle. Jedes Segment enthält zuerst die Quelle-Identifikation und folgende tags: .TEXT;, .A , .J, .S, .Z, .K und .P in genau dieser Reihenfolge. Die tags .A und .S können Fortsetzungszeilen haben, ohne neue tags. Folgendes ist zu beachten:

This text-file includes the bibliographic reference (details) for each source. Each element includes first the source identifier (pdf files name) followed by the tags “.TEXT;, .A , .J, .S, .Z, .K und .P “ in the exact order. The tags .A and .S may have consecutive lines without new tags. The following has to be considered:

.A author
.J publication year
.S title
.Z journal name [unfortunately not parsed into journal, volume, issue, last page, first page]
.Z. book
.K private comments, such as digital copy available, library numbers, and new also DOIs
.P identifier for the record

Beispiel (Example):

.TEXT;
acurio et al., 2013
.A Acurio, A., Rafael, V., Cespedes, D., and Ruiz, A.,
.J 2013
.S Description of a New Spotted Wing Drosophila
(Diptera: Drosophilidae) Species and Its Evolutionary
Relationships Inferred by a Cladistic Analysis of
Morphological Traits.
.Z Ann. ent. Soc. Am., 106:1-11.
.K pdf
.P Acurio et al., 2013
jhpoelen commented 7 months ago

hey @myrmoteras -

I've prepared some test uploads for you, please review -

https://sandbox.zenodo.org/records/28727 https://sandbox.zenodo.org/records/28729 https://sandbox.zenodo.org/records/28731 https://sandbox.zenodo.org/records/28733 https://sandbox.zenodo.org/records/28735 https://sandbox.zenodo.org/records/28737 https://sandbox.zenodo.org/records/28739 https://sandbox.zenodo.org/records/28741 https://sandbox.zenodo.org/records/28743 https://sandbox.zenodo.org/records/28745 https://sandbox.zenodo.org/records/28747 https://sandbox.zenodo.org/records/28749 https://sandbox.zenodo.org/records/28751 https://sandbox.zenodo.org/records/28753 https://sandbox.zenodo.org/records/28755 https://sandbox.zenodo.org/records/28757 https://sandbox.zenodo.org/records/28759 https://sandbox.zenodo.org/records/28761 https://sandbox.zenodo.org/records/28763 https://sandbox.zenodo.org/records/28765 https://sandbox.zenodo.org/records/28767 https://sandbox.zenodo.org/records/28769 https://sandbox.zenodo.org/records/28771 https://sandbox.zenodo.org/records/28773 https://sandbox.zenodo.org/records/28775 https://sandbox.zenodo.org/records/28777 https://sandbox.zenodo.org/records/28779 https://sandbox.zenodo.org/records/28781 https://sandbox.zenodo.org/records/28783 https://sandbox.zenodo.org/records/28785 https://sandbox.zenodo.org/records/28787 https://sandbox.zenodo.org/records/28789 https://sandbox.zenodo.org/records/28791 https://sandbox.zenodo.org/records/28793 https://sandbox.zenodo.org/records/28795 https://sandbox.zenodo.org/records/28797 https://sandbox.zenodo.org/records/28799 https://sandbox.zenodo.org/records/28801 https://sandbox.zenodo.org/records/28803 https://sandbox.zenodo.org/records/28805 https://sandbox.zenodo.org/records/28807 https://sandbox.zenodo.org/records/28809 https://sandbox.zenodo.org/records/28811 https://sandbox.zenodo.org/records/28813 https://sandbox.zenodo.org/records/28815 https://sandbox.zenodo.org/records/28817 https://sandbox.zenodo.org/records/28819 https://sandbox.zenodo.org/records/28821 https://sandbox.zenodo.org/records/28823 https://sandbox.zenodo.org/records/28825 https://sandbox.zenodo.org/records/28827 https://sandbox.zenodo.org/records/28829 https://sandbox.zenodo.org/records/28831 https://sandbox.zenodo.org/records/28833 https://sandbox.zenodo.org/records/28835 https://sandbox.zenodo.org/records/28837 https://sandbox.zenodo.org/records/28839 https://sandbox.zenodo.org/records/28841 https://sandbox.zenodo.org/records/28843 https://sandbox.zenodo.org/records/28845 https://sandbox.zenodo.org/records/28847 https://sandbox.zenodo.org/records/28849 https://sandbox.zenodo.org/records/28851 https://sandbox.zenodo.org/records/28853 https://sandbox.zenodo.org/records/28855 https://sandbox.zenodo.org/records/28857 https://sandbox.zenodo.org/records/28859

myrmoteras commented 7 months ago

add comment here

jhpoelen commented 7 months ago

@myrmoteras I've prepared another iteration of TaxoDros derived sandbox records. Please review.

See also attached csv file for download. taxodros-sandbox-review-2024-02-28.csv

zenodo record relation identifier activity uuid
https://sandbox.zenodo.org/records/31739 http://www.w3.org/ns/prov#wasDerivedFrom https://linker.bio/line:hash://md5/ff86b940567d278e50fa00672cf96629!/L190498-L190506 urn:uuid:22b303f1-6d36-462b-b3e8-ec330a111c4c .
https://sandbox.zenodo.org/records/31739 http://www.w3.org/ns/prov#alternateOf urn:lsid:taxodros.uzh.ch:id:yu%20et%20al.%2C%201999b urn:uuid:22b303f1-6d36-462b-b3e8-ec330a111c4c .
https://sandbox.zenodo.org/records/31741 http://www.w3.org/ns/prov#wasDerivedFrom https://linker.bio/line:hash://md5/ff86b940567d278e50fa00672cf96629!/L115474-L115481 urn:uuid:22b303f1-6d36-462b-b3e8-ec330a111c4c .
https://sandbox.zenodo.org/records/31741 http://www.w3.org/ns/prov#alternateOf urn:lsid:taxodros.uzh.ch:id:mik%2C%201889 urn:uuid:22b303f1-6d36-462b-b3e8-ec330a111c4c .
https://sandbox.zenodo.org/records/31743 http://www.w3.org/ns/prov#wasDerivedFrom https://linker.bio/line:hash://md5/ff86b940567d278e50fa00672cf96629!/L41845-L41854 urn:uuid:22b303f1-6d36-462b-b3e8-ec330a111c4c .
https://sandbox.zenodo.org/records/31743 http://www.w3.org/ns/prov#alternateOf urn:lsid:taxodros.uzh.ch:id:dickinson%20et%20al.%2C%201993 urn:uuid:22b303f1-6d36-462b-b3e8-ec330a111c4c .
https://sandbox.zenodo.org/records/31745 http://www.w3.org/ns/prov#wasDerivedFrom https://linker.bio/line:hash://md5/ff86b940567d278e50fa00672cf96629!/L163617-L163626 urn:uuid:22b303f1-6d36-462b-b3e8-ec330a111c4c .
https://sandbox.zenodo.org/records/31745 http://www.w3.org/ns/prov#alternateOf urn:lsid:taxodros.uzh.ch:id:spencer%2C%201938b urn:uuid:22b303f1-6d36-462b-b3e8-ec330a111c4c .
myrmoteras commented 7 months ago

thanks, I will look into this asap.

One thing that I am missing are a basic set of keyords that we should add routinely Biodiversity Taxonomy Animalia Arthropoda Insecta Diptera

may be also Drosophilidae

this would not need input from another file. Possible?

jhpoelen commented 7 months ago

Yes, the keywords can be added - Biodiversity Taxonomy Animalia Arthropoda Insecta Diptera

But adding publication specific taxonomic info is possible, but beyond the scope of this first phase.

Hope you understand.

myrmoteras commented 7 months ago

yes, I understand and that's what i just wrote. no input from other files than the source file

myrmoteras commented 7 months ago

checked and ok

https://sandbox.zenodo.org/records/31836
https://sandbox.zenodo.org/records/31832
https://sandbox.zenodo.org/records/31834
https://sandbox.zenodo.org/records/31830
https://sandbox.zenodo.org/records/31828
https://sandbox.zenodo.org/records/31826
https://sandbox.zenodo.org/records/31824
https://sandbox.zenodo.org/records/31741
https://sandbox.zenodo.org/records/31745
https://sandbox.zenodo.org/records/31747
https://sandbox.zenodo.org/records/31751
https://sandbox.zenodo.org/records/31755
https://sandbox.zenodo.org/records/31757
https://sandbox.zenodo.org/records/31759
https://sandbox.zenodo.org/records/31765
https://sandbox.zenodo.org/records/31767 book chapter
https://sandbox.zenodo.org/records/31769
https://sandbox.zenodo.org/records/31771 funny one from the tomato canners

checked with questions

https://sandbox.zenodo.org/records/31824 Comptes rendus des seances de la Societe de Biologie, 92, 778-780, 1925. should read ".. des séances de la Société de..." https://sandbox.zenodo.org/records/31753 uis a book chapter but in Bächlis' list it is a journal article? leave as is?

with a new Zenodo DOI bit with an existing DOI

https://sandbox.zenodo.org/records/31739 has https://doi.org/10.1038/sj.hdy.6885470
https://sandbox.zenodo.org/records/31743 has https://doi.org/10.1242/jeb.182.1.173
https://sandbox.zenodo.org/records/31749 has https://doi.org/10.1163/156853987x00468
https://sandbox.zenodo.org/records/31761 has https://doi.org/10.1080/00305316.1975.10434843
https://sandbox.zenodo.org/records/31763 has https://doi.org/10.1038/hdy.1950.15

jhpoelen commented 7 months ago

would you want me to add "fruit flies" "flies" "terrestrial" also?

myrmoteras commented 7 months ago

is there a reason, that you have always three version of the same deposit?

https://sandbox.zenodo.org/records/31836 http://www.w3.org/ns/prov#wasDerivedFrom line:hash://sha256/cb94e7c16a617a56a55fbbd76c458333111053bc501d52ae34548b35967933b2!/L25 https://sandbox.zenodo.org/records/31836 http://www.w3.org/ns/prov#wasDerivedFrom https://linker.bio/line:hash://md5/ff86b940567d278e50fa00672cf96629!/L175241-L175251

https://sandbox.zenodo.org/records/31836 http://www.w3.org/ns/prov#alternateOf urn:lsid:taxodros.uzh.ch:id:toda%2C%201985a —

jhpoelen commented 7 months ago

Yes, these three statements are about the same record and helps to find associated records and data.

jhpoelen commented 7 months ago

So, there's only one version of the record. And that records has many associations. These associations are listed row by row so that they fit into a table.

myrmoteras commented 7 months ago

I do not understand - do you want me to check these relationshisps that are machine generated?

jhpoelen commented 7 months ago

The relations are provided to set the context. The Zenodo record is for your review, the relations are provided to aid review, but are not subject to the review. Apologies for the confusion.

jhpoelen commented 7 months ago

For what it is worth, here's the list of just the zenodo records -

https://sandbox.zenodo.org/records/31739 https://sandbox.zenodo.org/records/31741 https://sandbox.zenodo.org/records/31743 https://sandbox.zenodo.org/records/31745 https://sandbox.zenodo.org/records/31747 https://sandbox.zenodo.org/records/31749 https://sandbox.zenodo.org/records/31751 https://sandbox.zenodo.org/records/31753 https://sandbox.zenodo.org/records/31755 https://sandbox.zenodo.org/records/31757 https://sandbox.zenodo.org/records/31759 https://sandbox.zenodo.org/records/31761 https://sandbox.zenodo.org/records/31763 https://sandbox.zenodo.org/records/31765 https://sandbox.zenodo.org/records/31767 https://sandbox.zenodo.org/records/31769 https://sandbox.zenodo.org/records/31771 https://sandbox.zenodo.org/records/31773 https://sandbox.zenodo.org/records/31775 https://sandbox.zenodo.org/records/31777 https://sandbox.zenodo.org/records/31779 https://sandbox.zenodo.org/records/31781 https://sandbox.zenodo.org/records/31783 https://sandbox.zenodo.org/records/31785 https://sandbox.zenodo.org/records/31787 https://sandbox.zenodo.org/records/31789 https://sandbox.zenodo.org/records/31791 https://sandbox.zenodo.org/records/31793 https://sandbox.zenodo.org/records/31795 https://sandbox.zenodo.org/records/31797 https://sandbox.zenodo.org/records/31799 https://sandbox.zenodo.org/records/31801 https://sandbox.zenodo.org/records/31803 https://sandbox.zenodo.org/records/31804 https://sandbox.zenodo.org/records/31806 https://sandbox.zenodo.org/records/31808 https://sandbox.zenodo.org/records/31810 https://sandbox.zenodo.org/records/31812 https://sandbox.zenodo.org/records/31814 https://sandbox.zenodo.org/records/31816 https://sandbox.zenodo.org/records/31818 https://sandbox.zenodo.org/records/31820 https://sandbox.zenodo.org/records/31822 https://sandbox.zenodo.org/records/31824 https://sandbox.zenodo.org/records/31826 https://sandbox.zenodo.org/records/31828 https://sandbox.zenodo.org/records/31830 https://sandbox.zenodo.org/records/31832 https://sandbox.zenodo.org/records/31834 https://sandbox.zenodo.org/records/31836

myrmoteras commented 7 months ago

checked and ok

https://sandbox.zenodo.org/records/31836 https://sandbox.zenodo.org/records/31832 https://sandbox.zenodo.org/records/31834 https://sandbox.zenodo.org/records/31830 https://sandbox.zenodo.org/records/31828 https://sandbox.zenodo.org/records/31826 https://sandbox.zenodo.org/records/31824

checked with questions

https://sandbox.zenodo.org/records/31824 Comptes rendus des seances de la Societe de Biologie, 92, 778-780, 1925. should read ".. des séances de la Société de..."

here are the first few - I will check tomorrow. tx

myrmoteras commented 7 months ago

@jhpoelen I did some testing https://github.com/bio-guoda/preston/issues/275#issuecomment-1969980923 and need to look at a proposal for now.

there are some issue regarding existing DOI.

the biodiversity Literature repository community is not linked

adding the keywords

jhpoelen commented 7 months ago

With publication of 19,452 items in the Zenodo TaxoDros community https://zenodo.org/communities/taxodros/records , evidence suggests that Preston is able to stream taxodros files and publish derived metadata records.

image