bio-guoda / preston

a biodiversity dataset tracker
MIT License
24 stars 1 forks source link

build corpus for Mammal Handbook of the World v9 #171

Closed jhpoelen closed 2 years ago

jhpoelen commented 2 years ago

related to https://github.com/plazi/treatmentBank/issues/49 and https://github.com/jhpoelen/hbmw-9

jhpoelen commented 2 years ago

some notes from slack discussion:

Aja Sherman 2:46 PM @Donat Agosti @Jorrit Poelen (GloBI) Hi! Can we access your file with the Handbook Vol 9 Mammals of the World treatments? Is it a file you can email (aja@batbase.org) or is it a file we can share? @Cullen Geiselman and I would like to look at gaps in species distribution and ecological data to compare with batbase.org. Once we get both of our abstracts submitted, we would like to meet with the two of you to learn how to pull out specific queries.

jhpoelen commented 2 years ago

@Aja Sherman here is the link to all the chapters of the HBMW-9 and https://tb.plazi.org/GgServer/summary/FFFA1756FFFA347FFF8E4379FF81011F and the list of all the species https://tb.plazi.org/GgServer/srsStats/stats?outputFields=doc.uuid+doc.name+doc.art[…]rderingFields=-bib.title&FP-doc.name=%22hbmw%25%22&format=HTML The UUID (Document UUID for treatment, Article UUID for articles) are hyperlinked allowing you to access the treatments white_check_mark eyes raised_hands

3:34 all the data is accessible through https://tb.plazi.org/GgServer/srsStats for treatments and https://tb.plazi.org/GgServer/dioStats for artciles 3:34 If you want a dump, I ask Guido, In what format would you like to have the dump?

jhpoelen commented 2 years ago

Donat Agosti 3:40 PM @Aja Sherman this includes also the two groups processed a year ago (Rhinolophidae, Emballonuridae) https://tb.plazi.org/GgServer/dioStats/stats?outputFields=doc.articleUuid+doc.name+[…]b.source+cont.treatCount&FP-doc.name=%22hbmw%25%22&format=HTML

jhpoelen commented 2 years ago

or in JSON https://tb.plazi.org/GgServer/dioStats/stats?outputFields=doc.articleUuid+doc.name+[…]b.source+cont.treatCount&FP-doc.name=%22hbmw%25%22&format=JSON

jhpoelen commented 2 years ago

Donat Agosti 6:20 PM if you want to use http://tb.plazi.org/GgServer/dwca/.zip where you replace the with the value in the above table, then you can download all the family level article. The ZIP contains to files, whereby media.txt includes all the treatments including some html. descrition is text only, no para break, ad the nomeclature part is not included. If yo want more detail, then you should tell us what you need. for example you want to get also the taxonomic names tagged in teh treatment, and if possible material citation, then you could use the taxpub we export to SIBiLS: TaxPub Level 1: https://tb.plazi.org/GgServer/taxPubL1/039487A4FFD3FFB8FD89F26CFD84FCA3 just replace the UUID with the treatment UUID (please note that we renamed the document UUID to treatment UUID to get rid of some confusion: If you want to have the entire Plazi internal XML then use for a treament: TaxPub Level 1: https://tb.plazi.org/GgServer/taxPubL1/039487A4FFD3FFB8FD89F26CFD84FCA3

jhpoelen commented 2 years ago

https://github.com/plazi/treatments-xml and derived https://github.com/plazi/treatments-rdf are said to contain the treatments processed by Plazi.

However, when searching for Mammal Handbook of the World V9 via plazi web service

https://tb.plazi.org/GgServer/srsStats/stats?outputFields=doc.uuid+doc.doi+doc.name+doc.articleUuid&groupingFields=doc.uuid+doc.doi+doc.name+doc.articleUuid&FP-doc.name=%22hbmw-9%25%22%20%22hbmw_9%25%22&format=TSV .

I found

curl "https://tb.plazi.org/GgServer/srsStats/stats?outputFields=doc.uuid+doc.doi+doc.name+doc.articleUuid&groupingFields=doc.uuid+doc.doi+doc.name+doc.articleUuid&FP-doc.name=%22hbmw-9%25%22%20%22hbmw_9%25%22&format=TSV" | head -n3
DocCount    DocUuid DocDoi  DocNameDocArticleUuid
1   E84887F9FFD6D6580B7BFE2016E532DE        hbmw_9_Miniopteridae_674.pdf.imf    1471FF81FFD6D6580A4AFFEC112F3619
1   E84887F9FFD4D65A0AC8FE4618BD3100    http://doi.org/10.5281/zenodo.5735204   hbmw_9_Miniopteridae_674.pdf.imf    1471FF81FFD6D6580A4AFFEC112F3619

(see screenshot)

However, when looking in https://github.com/plazi/treatments-xml , I was unable to locate treatment with uuid E84887F9FFD6D6580B7BFE2016E532DE . -

$ unzip -l data/28/b1/28b14cb51da9669062e56b46baf94b928a4db800dac9dc598ef12e285b272ee5 | pv -l | grep --ignore-case E84887F9FFD6D6580B7BFE2016E532DE
 829k 0:00:09 [83.3k/s] [                <=>                                                                   ]

However, I was able to find E84887F9FFD6D6580B7BFE2016E532DE :+1:

$ unzip -l data/28/b1/28b14cb51da9669062e56b46baf94b928a4db800dac9dc598ef12e285b272ee5 | pv -l | grep --ignore-case E84887F9FFD4D65A0AC8FE4618BD3100
    28737  2022-05-04 07:40   treatments-xml-main/data/E8/48/87/E84887F9FFD4D65A0AC8FE4618BD3100.xml
 829k 0:00:09 [85.1k/s] [                <=>                                                                   ]

where

<https://github.com/plazi/treatments-xml/archive/master.zip> <http://purl.org/pav/hasVersion> <hash://sha256/28b14cb51da9669062e56b46baf94b928a4db800dac9dc598ef12e285b272ee5> <urn:uuid:a7da841d-214d-40ba-8158-a1b619bcb099> .
jhpoelen commented 2 years ago
$ unzip -p data/28/b1/28b14cb51da9669062e56b46baf94b928a4db800dac9dc598ef12e285b272ee5  treatments-xml-main/data/E8/48/87/E84887F9FFD4D65A0AC8FE4618BD3100.xml\
 | xmllint --format -

produced:

<?xml version="1.0"?>
<document ID-DOI="http://doi.org/10.5281/zenodo.5735204" ID-GBIF-Dataset="a1d7ccc4-c76c-4ac3-bdd4-c53c2f378b35" ID-ISBN="978-84-16728-19-0" ID-Zenodo-Dep="5735204" approvalRequired="98" approvalRequired_for_document="2" approvalRequired_for_matCits="7" approvalRequired_for_originalDoi="1" approvalRequired_for_taxonomicNames="36" approvalRequired_for_treatments="52" checkinTime="1600872994229" checkinUser="plazi" docAuthor="Don E. Wilson &amp; Russell A. Mittermeier" docDate="2019" docId="E84887F9FFD4D65A0AC8FE4618BD3100" docLanguage="en" docName="hbmw_9_Miniopteridae_674.pdf.imf" docOrigin="Handbook of the Mammals of the World &#x2013; Volume 9 Bats, Barcelona: Lynx Edicions" docTitle="Miniopterus fuliginosus" docType="treatment" docVersion="5" lastPageNumber="693" masterDocId="1471FF81FFD6D6580A4AFFEC112F3619" masterDocTitle="Miniopteridae" masterLastPageNumber="709" masterPageNumber="674" pageId="2" pageNumber="693" updateTime="1649267493270" updateUser="ExternalLinkService">
  <mods:mods xmlns:mods="http://www.loc.gov/mods/v3">
    <mods:titleInfo>
      <mods:title>Miniopteridae</mods:title>
    </mods:titleInfo>
    <mods:name type="personal">
      <mods:role>
        <mods:roleTerm>Author</mods:roleTerm>
      </mods:role>
      <mods:namePart>Don E. Wilson</mods:namePart>
    </mods:name>
    <mods:name type="personal">
      <mods:role>
        <mods:roleTerm>Author</mods:roleTerm>
      </mods:role>
      <mods:namePart>Russell A. Mittermeier</mods:namePart>
    </mods:name>
    <mods:typeOfResource>text</mods:typeOfResource>
    <mods:relatedItem type="host">
      <mods:originInfo>
        <mods:dateIssued>2019</mods:dateIssued>
        <mods:dateOther type="pubDate">2019-10-31</mods:dateOther>
        <mods:publisher>Lynx Edicions</mods:publisher>
        <mods:place>
          <mods:placeTerm>Barcelona</mods:placeTerm>
        </mods:place>
      </mods:originInfo>
      <mods:titleInfo>
        <mods:title>Handbook of the Mammals of the World &#x2013; Volume 9 Bats</mods:title>
      </mods:titleInfo>
      <mods:part>
        <mods:extent unit="page">
          <mods:start>674</mods:start>
          <mods:end>709</mods:end>
        </mods:extent>
      </mods:part>
    </mods:relatedItem>
    <mods:classification>book chapter</mods:classification>
    <mods:identifier type="DOI">http://doi.org/10.5281/zenodo.5735202</mods:identifier>
    <mods:identifier type="GBIF-Dataset">a1d7ccc4-c76c-4ac3-bdd4-c53c2f378b35</mods:identifier>
    <mods:identifier type="ISBN">978-84-16728-19-0</mods:identifier>
    <mods:identifier type="Zenodo-Dep">5735202</mods:identifier>
  </mods:mods>
  <treatment ID-DOI="http://doi.org/10.5281/zenodo.5735204" ID-Zenodo-Dep="5735204" LSID="urn:lsid:plazi:treatment:E84887F9FFD4D65A0AC8FE4618BD3100" httpUri="http://treatment.plazi.org/id/E84887F9FFD4D65A0AC8FE4618BD3100" lastPageNumber="693" pageId="2" pageNumber="693">
    <subSubSection box="[130,159,426,472]" pageId="2" pageNumber="693" type="multiple">
      <paragraph blockId="2.[124,1185,426,555]" box="[130,159,426,472]" pageId="2" pageNumber="693">
        <heading box="[130,159,426,472]" pageId="2" pageNumber="693">
          <figureCitation box="[130,159,426,472]" captionStart="Plate 52: Miniopteridae" captionStartId="2.[117,147,3330,3355]" captionTargetBox="[11,2764,18,3659]" captionTargetPageId="1" captionText="1. Asian Long-fingered Bat (Miniopterus fuliginosus), 2. Large Long-fingered Bat (Miniopterus magnater), 3. Small Long-fingered Bat (Miniopterus pusillus), 4. Intermediate Long-fingered Bat (Miniopterus meduius), 5. Ryukyu Long-fingered Bat (Miniopterus fuscus), 6. Eschscholtz&#x2019;s Long-fingered Bat (Miniopterus eschscholtzii), 7. Philippine Long-fingered Bat (Miniopterus paululus), 8. Great Long-fingered Bat (Miniopterus tristis), 9. Shortridge&#x2019;s Long-fingered Bat (Miniopterus shortridgei), 10. Javanese Long-fingered Bat (Miniopterus blepotis), 11. Little Long-fingered Bat (Miniopterus australis), 12. Small Melanesian Long-fingered Bat (Miniopterus macrocneme), 13. Loyalty Long-fingered Bat (Miniopterus robustior), 14. Australian Long-fingered Bat (Miniopterus orianae), 15. Pallid Long-fingered Bat (Miniopterus pallidus), 16. Schreibers&#x2019;s Long-fingered Bat (Miniopterus schreibersu), 17. Neghrohion Long-fingered Bat (Miniopterus maghrebensis)" figureDoi="http://doi.org/10.5281/zenodo.6419162" httpUri="https://zenodo.org/record/6419162/files/figure.png" pageId="2" pageNumber="693">1.</figureCitation>
        </heading>
      </paragraph>
    </subSubSection>
    <subSubSection box="[173,686,426,472]" pageId="2" pageNumber="693" type="vernacular_names">
      <paragraph blockId="2.[124,1185,426,555]" box="[173,686,426,472]" pageId="2" pageNumber="693">
        <heading box="[173,686,426,472]" pageId="2" pageNumber="693">
          <vernacularName box="[173,686,426,472]" pageId="2" pageNumber="693">Asian Long-fingered Bat</vernacularName>
        </heading>
      </paragraph>
    </subSubSection>
    <subSubSection box="[735,1185,426,472]" pageId="2" pageNumber="693" type="nomenclature">
      <paragraph blockId="2.[124,1185,426,555]" box="[735,1185,426,472]" pageId="2" pageNumber="693">
        <heading box="[735,1185,426,472]" pageId="2" pageNumber="693">
          <taxonomicName box="[735,1185,426,472]" class="Mammalia" family="Miniopteridae" genus="Miniopterus" kingdom="Animalia" order="Chiroptera" pageId="2" pageNumber="693" phylum="Chordata" rank="genus" species="fuliginosus">
            <emphasis box="[735,1185,426,472]" italics="true" pageId="2" pageNumber="693">Miniopterus fuliginosus</emphasis>
          </taxonomicName>
        </heading>
      </paragraph>
    </subSubSection>
    <subSubSection pageId="2" pageNumber="693" type="vernacular_names">
      <paragraph blockId="2.[124,1185,426,555]" box="[126,1176,490,511]" pageId="2" pageNumber="693">
        <heading box="[126,1176,490,511]" pageId="2" pageNumber="693"><emphasis bold="true" box="[126,202,490,511]" pageId="2" pageNumber="693">French:</emphasis><vernacularName box="[211,423,490,511]" pageId="2" pageNumber="693">Miniopt&#xE9;re fuligineux</vernacularName>
/ 
<emphasis bold="true" box="[443,534,490,511]" pageId="2" pageNumber="693">German:</emphasis>
<vernacularName box="[542,867,490,511]" pageId="2" pageNumber="693">Asiatische Langfligelfledermaus</vernacularName>
/ 
<taxonomicName authority=": Minidptero" authorityName="Minidptero" box="[888,1097,490,511]" class="Aves" family="Phasianidae" genus="Spanish" kingdom="Animalia" order="Galliformes" pageId="2" pageNumber="693" phylum="Chordata" rank="genus"><emphasis bold="true" box="[888,979,490,511]" pageId="2" pageNumber="693">Spanish:</emphasis><vernacularName box="[990,1097,490,511]" pageId="2" pageNumber="693">Minidptero</vernacularName></taxonomicName>
de Asia
</heading>
      </paragraph>
      <paragraph blockId="2.[124,1185,426,555]" box="[125,1152,530,551]" pageId="2" pageNumber="693">
        <heading box="[125,1152,530,551]" pageId="2" pageNumber="693"><emphasis bold="true" box="[125,372,530,551]" pageId="2" pageNumber="693">Other common names:</emphasis><vernacularName box="[380,611,530,551]" pageId="2" pageNumber="693">Asian Bent-winged Bat</vernacularName>
, 
<vernacularName box="[626,876,530,551]" pageId="2" pageNumber="693">Eastern Bent-winged Bat</vernacularName>
, 
<vernacularName box="[891,1152,530,551]" pageId="2" pageNumber="693">Eastern Long-fingered Bat</vernacularName>
</heading>
      </paragraph>
    </subSubSection>
    <subSubSection pageId="2" pageNumber="693" type="reference_group">
      <paragraph blockId="2.[735,1329,598,1025]" pageId="2" pageNumber="693"><emphasis bold="true" box="[737,893,598,631]" pageId="2" pageNumber="693">Taxonomy.</emphasis>
Vespertiliofuliginosa [sic] Hodgson, 1835, 
<materialsCitation box="[889,1004,645,670]" pageId="2" pageNumber="693">
&#x201C; 
<collectingCountry box="[900,988,645,670]" name="Nepal" pageId="2" pageNumber="693">Nepal</collectingCountry>
.&#x201D;
</materialsCitation>
</paragraph>
      <paragraph blockId="2.[735,1329,598,1025]" pageId="2" pageNumber="693"><taxonomicName authorityName="Bonaparte" authorityYear="1837" box="[738,892,677,710]" class="Mammalia" family="Miniopteridae" genus="Miniopterus" kingdom="Animalia" order="Chiroptera" pageId="2" pageNumber="693" phylum="Chordata" rank="genus">Miniopterus</taxonomicName>
fuliginosus was traditionally included in 
<taxonomicName baseAuthorityName="Kuhl" baseAuthorityYear="1817" box="[915,1098,715,748]" class="Mammalia" family="Miniopteridae" genus="Miniopterus" kingdom="Animalia" order="Chiroptera" pageId="2" pageNumber="693" phylum="Chordata" rank="species" species="schreibersii">M. schreibersii</taxonomicName>
until recent genetic and morphometric evidence confirmed it as a valid species and totally independent of West Palearctic 
<taxonomicName authorityName="Bonaparte" authorityYear="1837" box="[1174,1329,834,867]" class="Mammalia" family="Miniopteridae" genus="Miniopterus" kingdom="Animalia" order="Chiroptera" pageId="2" pageNumber="693" phylum="Chordata" rank="genus">Miniopterus</taxonomicName>
and different from the rest of Eastern/ Australian species once included in the schreibersui species complex (
<taxonomicName authorityName="Sanborn" authorityYear="1931" box="[1147,1322,956,985]" class="Mammalia" family="Miniopteridae" genus="Miniopterus" kingdom="Animalia" order="Chiroptera" pageId="2" pageNumber="696" phylum="Chordata" rank="species" species="magnater">M. magnater</taxonomicName>
, 
<taxonomicName baseAuthorityName="Waterhouse" baseAuthorityYear="1845" box="[738,933,1000,1025]" class="Mammalia" family="Miniopteridae" genus="Miniopterus" kingdom="Animalia" order="Chiroptera" pageId="2" pageNumber="693" phylum="Chordata" rank="species" species="eschscholtzii">M. eschscholtzii</taxonomicName>
, 
<taxonomicName authorityName="Bonaparte" authorityYear="1837" baseAuthorityName="Temminck" baseAuthorityYear="1840" box="[949,1087,1000,1025]" class="Mammalia" family="Miniopteridae" genus="Miniopterus" kingdom="Animalia" order="Chiroptera" pageId="2" pageNumber="698" phylum="Chordata" rank="species" species="blepotis">M. blepotis</taxonomicName>
, and 
<taxonomicName authorityName="Thomas" authorityYear="1922" box="[1165,1312,1000,1025]" class="Mammalia" family="Miniopteridae" genus="Miniopterus" kingdom="Animalia" order="Chiroptera" pageId="2" pageNumber="693" phylum="Chordata" rank="species" species="orianae">M. orianae</taxonomicName>
).
</paragraph>
      <paragraph blockId="2.[116,1329,1031,3234]" pageId="2" pageNumber="693">
Taxonomy of M. fuliginosus is not completely resolved because no genetic study has included samples from populations in south-central 
<collectingCountry box="[869,943,1070,1103]" name="India" pageId="2" pageNumber="693">India</collectingCountry>
and 
<collectingCountry box="[1017,1152,1070,1103]" name="Sri Lanka" pageId="2" pageNumber="693">Sri Lanka</collectingCountry>
that are isolated from other populations and live in very different environments. Bats formerly assigned to the 
<taxonomicName baseAuthorityName="Kuhl" baseAuthorityYear="1817" box="[347,480,1149,1182]" class="Mammalia" family="Miniopteridae" genus="Miniopterus" kingdom="Animalia" order="Chiroptera" pageId="2" pageNumber="693" phylum="Chordata" rank="species" species="schreibersii">schreibersii</taxonomicName>
species complex from mainland South-east Asia need to be genetically identified to know if they belong to M. fuliginosus, 
<taxonomicName authorityName="Bonaparte" authorityYear="1837" baseAuthorityName="Temminck" baseAuthorityYear="1840" box="[1017,1157,1188,1221]" class="Mammalia" family="Miniopteridae" genus="Miniopterus" kingdom="Animalia" order="Chiroptera" pageId="2" pageNumber="698" phylum="Chordata" rank="species" species="blepotis">M. blepotis</taxonomicName>
, or 
<taxonomicName authorityName="Sanborn" authorityYear="1931" class="Mammalia" family="Miniopteridae" genus="Miniopterus" kingdom="Animalia" order="Chiroptera" pageId="2" pageNumber="696" phylum="Chordata" rank="species" species="magnater">M. magnater</taxonomicName>
. Monotypic.
</paragraph>
    </subSubSection>
    <subSubSection pageId="2" pageNumber="693" type="distribution">
      <caption ID-DOI="http://doi.org/10.5281/zenodo.5735206" ID-Zenodo-Dep="5735206" httpUri="https://zenodo.org/record/5735206/files/figure.png" inLine="true" pageId="2" pageNumber="693" targetBox="[122,713,605,1018]" targetPageId="2">
        <paragraph blockId="2.[116,1329,1031,3234]" pageId="2" pageNumber="693"><emphasis bold="true" box="[123,299,1271,1300]" pageId="2" pageNumber="693">Distribution.</emphasis>
NE 
<collectingCountry box="[365,537,1271,1300]" name="Afghanistan" pageId="2" pageNumber="693">Afghanistan</collectingCountry>
, N 
<collectingCountry box="[589,707,1271,1300]" name="Pakistan" pageId="2" pageNumber="693">Pakistan</collectingCountry>
, NW, N, NE &amp; S 
<collectingCountry box="[953,1028,1271,1300]" name="India" pageId="2" pageNumber="693">India</collectingCountry>
, 
<collectingCountry box="[1046,1131,1271,1300]" name="Nepal" pageId="2" pageNumber="693">Nepal</collectingCountry>
, 
<collectingCountry box="[1149,1286,1271,1300]" name="Sri Lanka" pageId="2" pageNumber="693">Sri Lanka</collectingCountry>
, N 
<collectingCountry box="[123,255,1311,1340]" name="Myanmar" pageId="2" pageNumber="693">Myanmar</collectingCountry>
, N 
<collectingCountry box="[303,425,1311,1340]" name="Vietnam" pageId="2" pageNumber="693">Vietnam</collectingCountry>
, S, SE &amp; E 
<collectingCountry box="[583,668,1311,1340]" name="China" pageId="2" pageNumber="693">China</collectingCountry>
, 
<collectingCountry box="[684,784,1311,1340]" name="Taiwan" pageId="2" pageNumber="693">Taiwan</collectingCountry>
, Korean Peninsula, extreme S Russian Far East, and 
<collectingCountry box="[308,393,1346,1379]" name="Japan" pageId="2" pageNumber="693">Japan</collectingCountry>
(except 
<collectingRegion box="[520,665,1346,1379]" country="Japan" name="Hokkaido" pageId="2" pageNumber="693">Hokkaido</collectingRegion>
); it may occur in 
<collectingCountry box="[910,1013,1346,1379]" name="Bhutan" pageId="2" pageNumber="693">Bhutan</collectingCountry>
and 
<collectingCountry box="[1085,1250,1346,1379]" name="Bangladesh" pageId="2" pageNumber="693">Bangladesh</collectingCountry>
.
</paragraph>
      </caption>
    </subSubSection>
    <subSubSection pageId="2" pageNumber="693" type="description">
      <paragraph blockId="2.[116,1329,1031,3234]" pageId="2" pageNumber="693"><emphasis bold="true" box="[122,375,1386,1419]" pageId="2" pageNumber="693">Descriptive notes.</emphasis>
Head-body 47-65 mm, tail 44-61 mm, ear 8:7-12 mm, hindfoot 7-12 mm, forearm 44-7-49-6 mm; weight 13-6 g (+ 1-1 g SD). Pelage of the Asian Long-fingered Bat is soft, velvety, and silky. Bases and tips of hairs are unicolored. Dorsal surface is blackish brown to gray-brown. Venteris slightly paler, usually dark gray-brown, and occasionally has a more or less reddish morph. Ears are short. Tragus is slightly curved forward. Membranes are dark, almost black. Dental formula for all species of Miniopterusis12/3,C1/1,P 2/3, M 3/3 (x2) = 36. Chromosomal complement has 2n = 46 and FN = 52 (
<collectingCountry box="[565,645,1666,1695]" name="Japan" pageId="2" pageNumber="693">Japan</collectingCountry>
and 
<collectingCountry box="[719,804,1666,1695]" name="China" pageId="2" pageNumber="693">China</collectingCountry>
) or 2n = 46 and FN = 54 (
<collectingCountry box="[1173,1253,1666,1695]" name="India" pageId="2" pageNumber="693">India</collectingCountry>
).
</paragraph>
    </subSubSection>
    <subSubSection pageId="2" pageNumber="693" type="biology_ecology">
      <paragraph blockId="2.[116,1329,1031,3234]" pageId="2" pageNumber="693"><emphasis bold="true" box="[122,233,1702,1735]" pageId="2" pageNumber="693">Habitat.</emphasis>
Mostly temperate habitats from arid steppes in 
<collectingCountry box="[915,1084,1702,1735]" name="Afghanistan" pageId="2" pageNumber="693">Afghanistan</collectingCountry>
to wooded areas in 
<collectingCountry box="[160,244,1741,1774]" name="China" pageId="2" pageNumber="693">China</collectingCountry>
and 
<collectingCountry box="[313,400,1741,1774]" name="Japan" pageId="2" pageNumber="693">Japan</collectingCountry>
, more tropical habitats in southern 
<collectingCountry box="[915,988,1741,1774]" name="India" pageId="2" pageNumber="693">India</collectingCountry>
(wet evergreen forests) and 
<collectingCountry box="[184,319,1781,1814]" name="Sri Lanka" pageId="2" pageNumber="693">Sri Lanka</collectingCountry>
, and mainly lower hilly country in 
<collectingCountry box="[807,940,1781,1814]" name="Sri Lanka" pageId="2" pageNumber="693">Sri Lanka</collectingCountry>
from sea level to elevations above 
<quantity box="[213,314,1823,1852]" metricMagnitude="3" metricUnit="m" metricValue="2.0" pageId="2" pageNumber="693" unit="m" value="2000.0">2000 m</quantity>
(Himalayas).
</paragraph>
    </subSubSection>
    <subSubSection pageId="2" pageNumber="693" type="food_feeding">
      <paragraph blockId="2.[116,1329,1031,3234]" pageId="2" pageNumber="693"><emphasis bold="true" box="[121,396,1860,1893]" pageId="2" pageNumber="693">Food and Feeding.</emphasis>
The Asian Long-fingered Bat typically forages in open spaces 9-12 m above grasslands, woodlands, and open water. Diet mainly contains species of 
<taxonomicName box="[121,297,1939,1972]" class="Insecta" kingdom="Animalia" order="Lepidoptera" pageId="2" pageNumber="693" phylum="Arthropoda" rank="order">Lepidoptera</taxonomicName>
, generally more than 50% by volume of prey. 
<taxonomicName authority=", Coleoptera, and Trichoptera" authorityName="Coleoptera, and Trichoptera" class="Insecta" kingdom="Animalia" order="Diptera" pageId="2" pageNumber="693" phylum="Arthropoda" rank="order">Diptera, Coleoptera, and Trichoptera</taxonomicName>
are also frequent prey but have a 
<taxonomicName authorityName="Peters" authorityYear="1867" box="[762,849,1978,2011]" class="Mammalia" family="Miniopteridae" genus="Miniopterus" kingdom="Animalia" order="Chiroptera" pageId="2" pageNumber="693" phylum="Chordata" rank="species" species="minor">minor</taxonomicName>
and variable importance depending on time of year and locality. 
<taxonomicName authority=", Ephemeroptera" authorityName="Ephemeroptera" box="[593,1034,2017,2050]" class="Insecta" kingdom="Animalia" order="Hymenoptera" pageId="2" pageNumber="693" phylum="Arthropoda" rank="order">Hymenoptera, Ephemeroptera</taxonomicName>
, and 
<taxonomicName box="[1118,1268,2017,2050]" class="Insecta" family="Noctuidae" genus="Plecoptera" kingdom="Animalia" order="Lepidoptera" pageId="2" pageNumber="693" phylum="Arthropoda" rank="genus">Plecoptera</taxonomicName>
are occasionally eaten. Body lengths of prey seem to be less than 
<quantity box="[983,1081,2056,2089]" metricMagnitude="-2" metricUnit="m" metricValue="2.5" pageId="2" pageNumber="693" unit="mm" value="25.0">25 mm</quantity>
.
</paragraph>
    </subSubSection>
    <subSubSection pageId="2" pageNumber="693" type="breeding">
      <paragraph blockId="2.[116,1329,1031,3234]" pageId="2" pageNumber="693"><emphasis bold="true" box="[120,254,2096,2129]" pageId="2" pageNumber="693">Breeding.</emphasis>
The Asian Long-fingered Bat is seasonally monoestrous, with only one young per pregnancy. This cycle has local variations to adapt to different climatic conditions throughout its wide distribution. Tropical populations of southern 
<collectingCountry box="[1186,1260,2175,2208]" name="India" pageId="2" pageNumber="693">India</collectingCountry>
and 
<collectingCountry box="[120,254,2215,2248]" name="Sri Lanka" pageId="2" pageNumber="693">Sri Lanka</collectingCountry>
do not have any delay throughout the cycle, but females in northern populations in cold climates in 
<collectingCountry box="[469,554,2254,2287]" name="Japan" pageId="2" pageNumber="693">Japan</collectingCountry>
have delayed implantation of blastocysts and post-implantation delays during gestation. This second delay is a facultative response to prolonged torporlinked to cool conditions and associated decreases in food availabilities. Gestation lasts &#xA2;.4 months in tropical populations and &#xA2;.8-5 months in northernmost populations; these northern populations have c.2 months of delayed implantation, c.3 months of delayed development, and 3-5 months of fetal growth. Populations in temperate mild climates, intermediate between these two extremes, only have a few months of delayed implantation. In any case, it seems that births in all populations are synchronized during short periods of time. In 
<collectingCountry box="[853,940,2570,2603]" name="Japan" pageId="2" pageNumber="693">Japan</collectingCountry>
, copulation takes place in autumn, and births occur synchronously from late June to earlyJuly. Most females give birth for the first time at the end of their second year. In tropical 
<collectingCountry box="[1077,1152,2649,2682]" name="India" pageId="2" pageNumber="693">India</collectingCountry>
, copulation takes place in the second and third weeks of February, and all births in the colony occur between 15 June and 25 June. Neonates are completely naked, with closed eyes, and weigh c. 
<quantity box="[301,350,2767,2800]" metricMagnitude="-3" metricUnit="kg" metricValue="3.0" pageId="2" pageNumber="693" unit="g" value="3.0">3 g</quantity>
. Sex ratio is even during their first 2-3 months oflife. It has been suggested that young are nursed communally, probably related to enormous size of the breeding colony (100,000-200,000 individuals). Lactating females are found until mid-August. By mid-October, young are the size and weight of adults. Sexual maturity of females is not reached until they are at least 20 months old and males at &#xA2;.19 months old.
</paragraph>
    </subSubSection>
    <subSubSection pageId="2" pageNumber="693" type="activity">
      <paragraph blockId="2.[116,1329,1031,3234]" lastBlockId="2.[1392,2608,282,1821]" pageId="2" pageNumber="693"><emphasis bold="true" box="[116,359,3004,3037]" pageId="2" pageNumber="693">Activity patterns.</emphasis>
In Ohse-do Cave (Kyushu district, 
<geoCoordinate box="[890,974,3004,3037]" degrees="32" direction="north" orientation="latitude" pageId="2" pageNumber="693" precision="55555" value="32.0">32&#xB0; N</geoCoordinate>
) in 
<collectingCountry box="[1039,1126,3004,3037]" name="Japan" pageId="2" pageNumber="693">Japan</collectingCountry>
, Asian Longfingered Bats started vocalizing c.1-2 hours before emerging from the cave. Such an early awakening is probably due to endogenous activity rhythm, and light sampling behavior could be seen a few minutes before individuals emerged. Emergence time was synchronized with sunset and correlated with appearance of prey. Asian Longfingered Bats are most active soon after sunset in early spring and late autumn and secondarily active before sunrise. This activity pattern seems to correspond to their feeding pattern. Feeding in summer lasted until c.02:00 h. There is always some activity during winter. When temperature at dusk is less than 7&#xB0;C, activity is greatly reduced, but when it is 7-13&#xB0;C,at least one-half of the colony becomes active. During periods of winter activity, Asian Long-fingered Bats must be foraging because fresh feces appear under colonies. Hibernation begins in December and ends at the end of February. In late autumn, body fat begins to rapidly increase and reaches maximum values at the end of November (weight 15-16 g for adults and 13-9-14-5 g for young). At the end of February when hibernation ends, body weights are 11-5-12-5 g for adults and 10-8-11-2 g for young. During this period, individuals select in the coldest areas of the cave with temperatures of 68&#xB0;C and maintain their body temperatures to less than one degree above ambient temperatures. Tropical populations do not hibernate. The Asian Long-fingered Bat typically roosts in caves but also uses abandoned mines, tunnels, and similar structures such as underground channels. Echolocation calls have downward FM signals. Regional characteristics include: start frequencies of 54-3-113 kHz, end frequencies of 42-9-53 kHz, peak frequencies of 44-5-62-4 kHz, and durations of 1-5-9 milliseconds in southern 
<collectingCountry box="[2315,2391,918,947]" name="India" pageId="2" pageNumber="693">India</collectingCountry>
; peak frequencies 53-5-57-5 kHz in 
<collectingCountry box="[1723,1810,953,986]" name="China" pageId="2" pageNumber="693">China</collectingCountry>
; peak frequencies of 50-3 kHz in 
<collectingCountry box="[2308,2394,953,986]" name="South Korea" pageId="2" pageNumber="693">Korea</collectingCountry>
; and peak frequencies of 52-1 kHz in 
<collectingCountry box="[1739,1827,992,1025]" name="Japan" pageId="2" pageNumber="693">Japan</collectingCountry>
.
</paragraph>
    </subSubSection>
    <subSubSection pageId="2" pageNumber="693" type="biology_ecology">
      <paragraph blockId="2.[1392,2608,282,1821]" pageId="2" pageNumber="693"><emphasis bold="true" box="[1399,2099,1032,1065]" pageId="2" pageNumber="693">Movements, Home range and Social organization.</emphasis>
The Asian Long-fingered Bat probably has a metapopulation structure, like other temperate species of 
<taxonomicName authorityName="Bonaparte" authorityYear="1837" box="[2442,2598,1071,1104]" class="Mammalia" family="Miniopteridae" genus="Miniopterus" kingdom="Animalia" order="Chiroptera" pageId="2" pageNumber="693" phylum="Chordata" rank="genus">Miniopterus</taxonomicName>
. Displacement of 
<quantity box="[1642,1746,1111,1144]" metricMagnitude="5" metricUnit="m" metricValue="2.0" pageId="2" pageNumber="693" unit="km" value="200.0">200 km</quantity>
was recorded in 
<collectingCountry box="[1988,2072,1111,1144]" name="Japan" pageId="2" pageNumber="693">Japan</collectingCountry>
that is of similar magnitude to those known in Europe for Schreibers&#x2019;s Long-fingered Bat between different refuges used by the same population. Breeding colonies of up to 12,000 individuals are known in 
<collectingCountry box="[1393,1480,1230,1263]" name="Japan" pageId="2" pageNumber="693">Japan</collectingCountry>
, consisting almost entirely of adult females. Hibernation colonies can have up to 83,000 individuals, although they are usually much smaller. The colony in Robbers&#x2019; Cave in Western Ghats, 
<collectingCountry box="[1743,1818,1317,1342]" name="India" pageId="2" pageNumber="693">India</collectingCountry>
, contains 100,000-200,000 individuals in the breeding period, and it includes females and males with no apparent sexual segregation. This colony is considered the &#x201C;mother colony,&#x201D; which contains individuals from other &#x201C;secondary colonies&#x201D; usually within 
<quantity box="[1847,1931,1426,1459]" metricMagnitude="4" metricUnit="m" metricValue="7.0" pageId="2" pageNumber="693" unit="km" value="70.0">70 km</quantity>
of the mother colony.
</paragraph>
    </subSubSection>
    <subSubSection pageId="2" pageNumber="693" type="conservation">
      <paragraph blockId="2.[1392,2608,282,1821]" pageId="2" pageNumber="693"><emphasis bold="true" box="[1398,1743,1475,1500]" pageId="2" pageNumber="693">Status and Conservation.</emphasis><collectingRegion box="[1753,1806,1475,1500]" country="Cameroon" name="North" pageId="2" pageNumber="693">Not</collectingRegion>
assessed as a separate species on The [UCNRed List, where itis included under Schreiber&#x2019;s Long-fingered Bat (
<taxonomicName baseAuthorityName="Kuhl" baseAuthorityYear="1817" box="[2117,2297,1506,1539]" class="Mammalia" family="Miniopteridae" genus="Miniopterus" kingdom="Animalia" order="Chiroptera" pageId="2" pageNumber="693" phylum="Chordata" rank="species" species="schreibersii">M. schreibersii</taxonomicName>
) as Near Threatened.
</paragraph>
    </subSubSection>
    <subSubSection pageId="2" pageNumber="693" type="bibRefCitation_list">
      <paragraph blockId="2.[1392,2608,282,1821]" pageId="2" pageNumber="693"><emphasis bold="true" box="[1398,1551,1555,1580]" pageId="2" pageNumber="693">Bibliography.</emphasis>
Akmali et al. (2015), Ao Lei et al. (2006), Appleton et al. (2004), Bates &amp; Harrison (1997), Benda &amp; Gaisler (2015), Brosset (1962c), Corbet &amp; Hill (1992), Francis (2008a), Francis et al. (2010), 
<collectingRegion box="[2370,2419,1595,1620]" country="Japan" name="Fukui" pageId="2" pageNumber="693">Fukui</collectingRegion>
et al. (2015), Funakoshi &amp; Takeda (1998), Funakoshi &amp; Uchida (1975, 1978a), Furman, Oztunc&#xA2; &amp; Coraman (2010), Gopalakrishna et al. (1986), Hendrichsen, Bates, Hayes &amp; Walston (2001), Hodgson (1835), Hu Kailiang et al. (2011), Kimura &amp; Uchida (1983), Kruskop et al. (2012), Li Shi et al. (2015), Maeda (1982), Mahmood-ul-Hassan &amp; Salim (2015), Ohdachi et al. (2009), Saikia (2018), Sramek et al. (2013), Srinivasulu, C. et al. (2010), Tian Lanxiang et al. (2004), Uchida et al. (1984), Vanitharani et al. (2013), Wordley et al. (2014), Zhang Chunmian et al. (2018).
</paragraph>
    </subSubSection>
  </treatment>
</document>
jhpoelen commented 2 years ago

desired extracted columns:

taxonID, Family, Genus, Species, taxonRank, scientificName, scientificNameAuthorship, canonicalName, verbatimScientificName, references (example:http://treatment.plazi.org/id/03D587F2FFC94C03F8F13AECFBD8F765), common name, french, german, spanish, taxonomy text, Distribution, Descriptive notes, Habitat, Food and Feeding, Breeding, Activity patterns, Movement/Home range/Social organization, Status and Conservation, Bibliography, page number, doi, media

jhpoelen commented 2 years ago

Here's an example prepared by Aja

taxonID Family Genus Species taxonRank scientificName scientificNameAuthorship canonicalName verbatimScientificName references common name french german spanish taxonomy text Distribution Descriptive notes Habitat Food and Feeding Breeding Activity patterns Movement/Home range/Social organization Status and Conservation Bibliography DOI page number media
                Taphozous troughtoni http://treatment.plazi.org/id/03D587F2FFC94C03F8F13AECFBD8F765 Troughton’s Sheath-tailed Bat Taphien deTroughton Troughton-Grabfledermaus Tafozo de Troughton Taphozous troughtoni Tate, 1952 , “ Rifle Creek, Mt. Isa, northwest Queensland,” Australia . Taphozous troughtoni is in the subgenus Taphozous . It was considered ajunior synonym of T georgianus , but. T. Chimimba and D. J. Kitchener in 1991 raised it to a distinct species. Monotypic. NE Australia endemic, in WC, C & E Queensland. Head-body 79-4-86-3 mm, tail 31-5-36-9 mm, ear 22-4-27-1 mm, hindfoot 9-8-10-3 mm, forearm 73-76 mm; weight. 20-29 g. Dorsum of Troughton’s Sheath-tailed Bat is predominately olive­ brown, with pale mouse-gray guard hairs. Venter surface hairs are olive-brown from chin to shoulders and posteriorly dark yellow-brown, with guard hairs of pale mouse-gray. Uropatagium close to abdomen is heavily furred. Throat pouches are absent, and radio-metacarpal sacs are present in both sexes. Skin of rhinarium, wings, uropatagium, lips, face, and tragus are fuscous (pale yellow). Wide variety of habitats and bioregions of interior Queensland. Troughton’s Sheath-tailed Bats forage for insects well above tree canopies and high over open habitats. Large, high-flying grasshoppers are preferred food items and often taken back to cave roosts to eat. No information. Troughton’s Sheath-tailed Bat roosts in caves, mines and tunnels, rock crevices, and rocky escarpments. Echolocation call is less than 25 kHz and distinguishes it from the Common Sheath-tailed Bat (. georgianus ) where they co-occur. Movements, Home range and Social organization. Large colonies of Troughton’s Sheath-tailed Bat can be found in landscapes with abundant rocky outcrops, especially in tower karst. Colony size might be limited by roosting structures, especially in more arid areas where there are few caves deep enough to support large colonies.   Classified as Least Concern on TheIUCNRed List. Troughton’s Sheath-tailed Bat has a large distribution and presumably large and stable overall population, uses a wide variety of habitats, occurs in protected areas, and does not face significant threats. It was originally recorded only from a small area in the Mount Isa Inland bioregion of Queensland, but recent studies based on isozymes and echolocation calls extend distribution further east throughout much of interior and near coastal region of central Queensland, formerly attributed to the Common Sheathtailed Bat. Recent reports of absence of Troughton’s Sheath-tailed Bat in western parts of its distribution require additional verification, possibly leading to re-evaluation of its conservation status after taxonomic issues are clarified. Chimimba & Kitchener (1991), Hall (2008b), McKean & Price (1967), Reardon & Thomson (2002), Tate (1952),Thomson eta /. (2001), Woinarski eta/. (2014). http://doi.org/10.5281/zenodo.3740269 355 Fig 1, Fig 2
          Epomophorus intermedius (from DocTitle in xml)       http://treatment.plazi.org/id/03AD87FAFFEFF6018C6735CDF9D4F7E3                             http://doi.org/10.5281/zenodo.6448973 101  
jhpoelen commented 2 years ago

in a recent copy of treatments-xml, I was able to locate just under 461 treatments that mention Handbook of Mammals of the World [something] Volume 9 .

$ preston track  https://github.com/plazi/treatments-xml/archive/master.zip\
 | preston grep "Handbook of the Mammals of the World.*Volume 9"\
 | tee hbmw_9.nq.txt

with

$ cat hbmw_9.nq.txt | grep -o -E "[A-F0-9]+[.]xml" | sort | uniq | wc -l
461

hbmw_9_internet-aliases.txt hbmw_9_preston-coordinates.txt hbmw_9.nq.txt

for those of you that like clicking, here's the first 10 transient internet aliases (aka urls) for the treatments:

cat hbmw_9_internet-aliases.txt | head produced:

https://github.com/plazi/treatments-xml/blob/main/data/03/83/24/0383245F222097788B1EF4BAFCBDF99E.xml https://github.com/plazi/treatments-xml/blob/main/data/03/83/24/0383245F2221977F8ED7FFC8FBC1F320.xml https://github.com/plazi/treatments-xml/blob/main/data/03/83/24/0383245F2223977C8F09F4DEF81EFB14.xml https://github.com/plazi/treatments-xml/blob/main/data/03/83/24/0383245F2224977B8E0DF4DFF949FD3B.xml https://github.com/plazi/treatments-xml/blob/main/data/03/83/24/0383245F222797788EC5F396F852FB4D.xml https://github.com/plazi/treatments-xml/blob/main/data/03/83/24/0383245F222797798BC0F1E7FB9BF74D.xml https://github.com/plazi/treatments-xml/blob/main/data/03/89/2A/03892A31FF8B145CE1C64F7DF800E39D.xml https://github.com/plazi/treatments-xml/blob/main/data/03/89/2A/03892A31FF8B145CE4C64BADFDFDE39B.xml https://github.com/plazi/treatments-xml/blob/main/data/03/89/2A/03892A31FF8C145AE1E74065F88AE79F.xml https://github.com/plazi/treatments-xml/blob/main/data/03/89/2A/03892A31FF8D145AE1354082FC71E59C.xml

jhpoelen commented 2 years ago

fyi @myrmoteras @mguidoti - please confirm that 461 treatments were transcribed from Mammals of The World Volume 9. If not, please suggest alternate method to better select these treatments from the Plazi corpus.

jhpoelen commented 2 years ago

I think I am finding clues of variation of the spelling of Handbook of Mammals of the World -

e.g.,

$ preston cat 'zip:hash://sha256/28b14cb51da9669062e56b46baf94b928a4db800dac9dc598ef12e285b272ee5!/treatments-xml-main/data/03/D5/87/03D587F2FFC94C03F8F13AECFBD8F765.xml' | head -n1
[...] Handbook of the Mammals of the World, Vol. 9, Lyny Edicions [...]

whereas

$ preston cat 'zip:hash://sha256/28b14cb51da9669062e56b46baf94b928a4db800dac9dc598ef12e285b272ee5!/treatments-xml-main/data/03/83/24/0383245F222097788B1EF4BAFCBDF99E.xml' | head -n1
[...] Handbook of the Mammals of the World – Volume 9 Bats, Barcelona: Lynx Edicions [...] 

I am curious what accounts for these variations.

jhpoelen commented 2 years ago

Would it be better to search for ISBN 978-84-16728-19-0 to point to https://www.lynxeds.com/product/handbook-of-the-mammals-of-the-world-volume-9/ ?

jhpoelen commented 2 years ago

A search for ISBN 978-84-16728-19-0 via:

$ preston track  https://github.com/plazi/treatments-xml/archive/master.zip\
 | preston grep "978-84-16728-19-0"\
 > hbmw_9_isbn.nq.txt

yields:

$ cat hbmw_9_isbn.nq.txt | grep -o -E "[A-F0-9]+[.]xml" | sort | uniq | wc -l
613

slightly more individual treatment references.

hbmw_9_isbn.nq.txt

myrmoteras commented 2 years ago

@jhpoelen here you get all the treatments of the HBMW-9 https://tb.plazi.org/GgServer/srsStats/stats?outputFields=doc.uuid+doc.name+doc.articleUuid+tax.name+tax.rank+tax.familyEpithet+tax.genusEpithet+tax.speciesEpithet&groupingFields=doc.uuid+doc.name+doc.articleUuid+tax.name+tax.rank+tax.familyEpithet+tax.genusEpithet+tax.speciesEpithet&FP-doc.name=%22hbmw_9%25%22%20%22hbmw-9%25%22&format=JSON

you can get all the taxpub versions of this by changing the treatment URL from https://tb.plazi.org/GgServer/html/E84887F9FFC4D64A0ACCF8EE14183AA4 to https://tb.plazi.org/GgServer/taxPubL1/E84887F9FFC4D64A0ACCF8EE14183AA4

ie /html/ to /taxPubL1/

evenutally you shouls also be able to download the treatments from pwd : eBioDiv2022 user : eBioDiv

http://denver.hesge.ch:5601/s/ebiodiv user : eBioDiv pwd : eBioDiv2022

switch to Lucene instead of KQL (right of the search bar), in order to be empowered with this powerful Lucene query language, e.g. article-title:(bat OR bats). More infos in https://www.lucenetutorial.com/lucene-query-syntax.html.

But, PLEASE note, this IS WORK IN PROGRESS, and we appreciate if you tell us how you want to access the data so we might be able to provide an access of your taste.

flsimoes commented 2 years ago

Hi @jhpoelen

Donat asked me to help out with this. What's the current issue we need to solve or understand?

myrmoteras commented 2 years ago

@jhpoelen you are free to use whatever data format. However, for data extraction, TDM we created specifically the taxpub version described above and we thus strongly recommend to use this. We are also interested to get feedback on this.

jhpoelen commented 2 years ago

@flsimoes thanks for your message and apologies for the delay.

I took another stab at discovering Handbook of the Mammals of the World related treatments in openly available Plazi resources.

To mine the Plazi corpus, I first got a recent copy of the https://github.com/plazi/treatments-xml using

$ mkdir preston-plazi
$ cd preston-plazi
$ preston track  https://github.com/plazi/treatments-xml/archive/master.zip

then, in the same directory, I extracted all xml blobs mentioning "Handbook of the Mammals of the World" by:

$ preston ls | preston plazi-stream | grep "Handbook of the Mammals of the World" > hmw.json

and because tables are friendly to spreadsheet programs, I converted the json to tsv using:

$ cat hmw.json\
|  jq -f schema.jq\
| mlr --ijson --otsvlite cat\
| tee hmw.tsv

Note that the generated json and related tsv represent are derived from a more loosely formatted plazi xml. Due to understandable variation in OCR (turning an image/pdf into text) the segmentation and parsing of the text into parts is variable. The example below is attempting to work towards @ajacsherman 's desired tabular form described earlier in https://github.com/bio-guoda/preston/issues/171#issuecomment-1117615287 .

Here the first line from hmw.json formatted by jq

$ cat hmw.json | head -n1 | jq .
{
  "http://www.w3.org/ns/prov#wasDerivedFrom": "zip:hash://sha256/669b07bf81a1e35383e3d83458751684d7416b0b75f4f425f8476a44b1119f42!/03D587F2FFC94C03F8F13AECFBD8F765.xml",
  "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "application/plazi+xml",
  "docId": "03D587F2FFC94C03F8F13AECFBD8F765",
  "docName": "hbmw-9.emballorunidae.pdf.imd",
  "docOrigin": "Handbook of the Mammals of the World, Vol. 9, Lyny Edicions",
  "docISBN": "978-84-16728-19-0",
  "interpretedAuthorityName": "Tate",
  "interpretedAuthorityYear": "1952",
  "interpretedClass": "Mammalia",
  "interpretedFamily": "Emballonuridae",
  "interpretedGenus": "Taphozous",
  "interpretedKingdom": "Animalia",
  "interpretedOrder": "Chiroptera",
  "interpretedPageId": "6",
  "interpretedPageNumber": "355",
  "interpretedPhylum": "Chordata",
  "interpretedRank": "species",
  "interpretedSpecies": "troughtoni",
  "name": "Taphozous troughtoni",
  "distribution": "NE Australia endemic, in WC, C & E Queensland.",
  "distributionImageURL": "https://zenodo.org/record/3747930/files/figure.png",
  "eats": "Troughton’s Sheath-tailed Bats forage for insects well above tree canopies and high over open habitats. Large, high-flying grasshoppers are preferred food items and often taken back to cave roosts to eat.",
  "activity": "Troughton’s Sheath-tailed Bat roosts in caves, mines and tunnels, rock crevices, and rocky escarpments. Echolocation call is less than 25 kHz and distinguishes it from the Common Sheath-tailed Bat (. georgianus ) where they co-occur . Movements, Home range and Social organization. Large colonies of Troughton’s Sheath-tailed Bat can be found in landscapes with abundant rocky outcrops, especially in tower karst. Colony size might be limited by roosting structures, especially in more arid areas where there are few caves deep enough to support large colonies.",
  "bibliography": "Chimimba & Kitchener (1991), Hall (2008b), McKean & Price (1967), Reardon & Thomson (2002), Tate (1952),Thomson eta /. (2001), Woinarski eta/. (2014).",
  "habitat": "Wide variety of habitats and bioregions of interior Queensland."
}

and the corresponding tabular representation:

$ cat hmw.tsv | head -n2 
docId   docOrigin   docISBN docName derivedFrom name    habitat eats    distribution    distributionImageURLactivity    bibliography    interpretedGenus    interpretedSpecies
03D587F2FFC94C03F8F13AECFBD8F765    Handbook of the Mammals of the World, Vol. 9, Lyny Edicions 978-84-16728-19-0   hbmw-9.emballorunidae.pdf.imd   zip:hash://sha256/669b07bf81a1e35383e3d83458751684d7416b0b75f4f425f8476a44b1119f42!/03D587F2FFC94C03F8F13AECFBD8F765.xml    Taphozous troughtoni    Wide variety of habitats and bioregions of interior Queensland. Troughton’s Sheath-tailed Bats forage for insects well above tree canopies and high over open habitats. Large, high-flying grasshoppers are preferred food items and often taken back to cave roosts to eat.    NE Australia endemic, in WC, C & E Queensland.  https://zenodo.org/record/3747930/files/figure.png  Troughton’s Sheath-tailed Bat roosts in caves, mines and tunnels, rock crevices, and rocky escarpments. Echolocation call is less than 25 kHz and distinguishes it from the Common Sheath-tailed Bat (. georgianus ) where they co-occur . Movements, Home range and Social organization. Large colonies of Troughton’s Sheath-tailed Bat can be found in landscapes with abundant rocky outcrops, especially in tower karst. Colony size might be limited by roosting structures, especially in more arid areas where there are few caves deep enough to support large colonies.   Chimimba & Kitchener (1991), Hall (2008b), McKean & Price (1967), Reardon & Thomson (2002), Tate (1952),Thomson eta /. (2001), Woinarski eta/. (2014).  Taphozous   troughtoni

hmw.tsv.gz hmw.json.gz hmw.tsv.txt hmw.csv

More discussion likely to follow, thanks for being patient as I am trying to understand the structure and provenance of the plazi resources.

PS To help explore the transformed plazi data, please see also https://docs.google.com/spreadsheets/d/1op5c_J59F2jIoi9YAnfBrPy3B9rFJ338sH2zQwJeU9o/edit?usp=sharing that uses the =LOADDATA("https://github.com/bio-guoda/preston/files/8909933/hmw.csv") to load the csv version of the data into a sheet.

Screenshot from 2022-06-15 08-57-08

myrmoteras commented 2 years ago

@jhpoelen why do you use JSON? https://tb.plazi.org/GgServer/srsStats image

just add the respetive file format at the end format=

https://tb.plazi.org/GgServer/srsStats/stats?outputFields=bib.source&groupingFields=bib.source&FP-bib.source=%22Handbook%20of%25%22&format=HTML

instead of

https://tb.plazi.org/GgServer/srsStats/stats?outputFields=bib.source&groupingFields=bib.source&FP-bib.source=%22Handbook%20of%25%22&format=JSON

or for TSV

https://tb.plazi.org/GgServer/srsStats/stats?outputFields=bib.source&groupingFields=bib.source&FP-bib.source=%22Handbook%20of%25%22&format=TSV

jhpoelen commented 2 years ago

hey @myrmoteras - thanks for asking. I am using versioned Plazi resources, instead of dynamic indexes and web services that plazi provides. This way, I better understand the provenance or origin of the data. In addition, by looking at the building blocks ( or knowledge atoms) of Plazi's treatment universe, I can likely better articulate which parts of the handled treatments I have questions about.

What is the origin of the dynamic queries that you mentioned above (e.g., https://tb.plazi.org/GgServer/srsStats/stats?outputFields=bib.source&groupingFields=bib.source&FP-bib.source=%22Handbook%20of%25%22&format=TSV) ?

myrmoteras commented 2 years ago

the origin of the queries is https://tb.plazi.org/GgServer/srsStats for data and statistics from treatments, and https://tb.plazi.org/GgServer/dioStats for anything related to article, including summaries of data therein, eg number of treatments, pages, etc.

jhpoelen commented 2 years ago

@myrmoteras thanks for clarifying. Does that mean that https://tb.plazi.org/GgServer/srsStats uses https://github.com/plazi/treatments-xml ?

jhpoelen commented 2 years ago

Some stats on the plazi treatment xml corpus -

$ du -d1 -h preston-plazi
760M    preston-plazi/data
769M    preston-plazi

or >750MB in compressed xml files. That is a lot! Close to the size of a single episode of Game of Thrones in High Definition.

myrmoteras commented 2 years ago

still not at the base yet - all the IMFs.... need to ask Guido how many TB.

and the 750Mb are only a minuscule part of what we know....

myrmoteras commented 2 years ago

@myrmoteras thanks for clarifying. Does that mean that https://tb.plazi.org/GgServer/srsStats uses https://github.com/plazi/treatments-xml ?

no. there is an XSLT inbetween that posts into treatments-xml.

We are working on writing this up...

jhpoelen commented 2 years ago

@myrmoteras I appreciate your responses. I am learning a lot.

Yes, I was talking about the Plazi corpus of "annotated text" as highlighted in steps 3/4 in https://www.globalbioticinteractions.org/plazi-zenodo/#plazi-transcription-provenance . I would expect this corpus to be much smaller than the scanned pages and related raw OCR outputs. I'd be curious to learn about your storage requirements: what is the volume of the IMFs related to the treatments referenced via treatments-xml ?

Would it be fair to say that the IMFs are involved in steps 1/2 ? Where do you store these? How to you keep track of the versions of these IMFs ?

Neat to hear that you are using XSLT to process treatment XML . Do you have examples of these style sheets? Do you have different style sheets for different publications? E.g., does Handbook of the Mammals of the World have a different style sheet than, lets say, ZooTaxa ?

jhpoelen commented 2 years ago

@ajacsherman shared another example of a bat treatment in desired tabular form

MOW Bible example - Sheet1.tsv.txt

see also:

docID docOrigin docISBN docName derivedFrom name interpretedGenus interpretedSpecies page number Other names taxonomy synonyms* distribution descriptive notes habitat food and feeding breeding activity patterns movements, home range, social organization status and conservation bibliography distributionImageURL comments
03BD87A2C660A212FF52F3F7F35547D7 Handbook of the Mammals of the World – Volume 9 Bats, Barcelona: Lynx Edicions 978-84-16728-19-0 hbmw_9_Hipposideridae_210.pdf.imf zip:hash://sha256/5146700132c798f057756c6fde84a3d4c426bdc372dbec6ba18ce4125aa8353b!/treatments-xml-main/data/03/BD/87/03BD87A2C660A212FF52F3F7F35547D7.xml Hipposideros tephrus Hipposideros tephrus 249 Maghreb Leaf-nosed bat, Phyllorhine cendree, Maghreb-Rundblattnase... Hipposideros [sic] tephrus Cabreera... Hipposideros caffer caffer Extent of this species' distribution is not yet known... Head-body 45-50... Inhabits riparian forest... The Maghreb Leaf-nosed Bat is likely to be insectivorous. Based on observations in northern Nigeria... The Maghreb Leaf-nosed Bat roosts in a variety of situations including caves, and holes in the ground. Echolocation call includes a F component at c.140-150 kHz. The Maghreb Leaf-nosed Bat may roost in large colonies of up to...   Aellen (1952), Bernard & Happold (2013b), Harrison & Bates (1991), Hill (1963a), Koopman (1989), Koopman et al. (1995), Nader (1982), Vallo et al. (2008), Van Cakenberghe et al. (2017).    
myrmoteras commented 2 years ago

Would it be fair to say that the IMFs are involved in steps 1/2 ? Where do you store these? How to you keep track of the versions of these IMFs ?

IMFs are after the annoted text and from where the XML files eventully are produced.

The IMFs are currently stored on two hard drives in our data center. Next week we will become part of inter-institutional tape library that works liks LOCKSS with multiple copies on different, geographically widely separated locations.

Each IMF stores the annotation history and thus changes can be reversed. The changes are part of the IMF. http://plazi.org/data-apis-tools/image-markup-file/

jhpoelen commented 2 years ago

@myrmoteras good to know that your image markup files contain an annotation history and will be backed up on tape soon!

I am assuming that these IMF files are not openly available. Is that correct? If so, did you consider sharing the hash (or content Id) of the IMF file in addition to their filename so that a matching IMF file used to generate an xml can be reliably referenced?

Just curious - how big is your corpus of IMF files are how are you keeping track of them?

myrmoteras commented 2 years ago

55,800 IMF files; we have treatmentBank that does administer them.

IMFs are not open because of copyright reasons. Otherwise, there would not be a problem.

for these technical questions regading hash etc, you need to talk to @gsautter

gsautter commented 2 years ago

I am assuming that these IMF files are not openly available. Is that correct? If so, did you consider sharing the hash (or content Id) of the IMF file in addition to their filename so that a matching IMF file used to generate an xml can be reliably referenced?

In fact, we do share the (MD5) hash of the PDF that an IMF and its descendant XML were produced from: it doubles as the document UUID, which is an approach born out of the need to prevent duplicates. Said UUID becomes the masterDocId attribute in treatments, and is dubbed the "Article UUID" in the TreatmentBank statistics (https://tb.plazi.org/GgServer/srsStats)

jhpoelen commented 2 years ago

@gsautter - great to hear that you are using md5 hashes to identify the master pdf docs. I've added a notation to identify the master doc using: e.g., "docMasterId" : "hash://md5/ffecff8affcf4c04ffa53577fff8ffe9"

With this, you can even use preston to try and find the associated pdf via zenodo or other content registries . . . if they were available.

e.g.,

$ preston cat --algo md5 --remote https://zenodo.org hash://md5/ffecff8affcf4c04ffa53577fff8ffe9

for the master pdf related to plazi doc id 03BD87A2C660A212FF52F3F7F35547D7 with content hash zip:hash//sha256/5146700132c798f057756c6fde84a3d4c426bdc372dbec6ba18ce4125aa8353b!/treatments-xml-main/data/03/BD/87/03BD87A2C660A212FF52F3F7F35547D7.xml .

But I guess the master pdf is either not on Zenodo, or Zenodo doesn't expose the md5 of "locked"/"closed" files. I wonder which one it is . . . it'd be fun to know that the file exists on Zenodo, but cannot be accessed somehow.

By the way, here's a working example of the using Zenodo as a content-based repository -

$ preston cat --algo md5 --remote https://zenodo.org hash://md5/dc675166d4401cea591b61341da30fd4 > poster.pdf 

with poster.pdf being the poster related to:

Poelen, Jorrit H., Wommack, Elisabeth A., Doll, Andrew C., & Mayfield-Meyer, Teresa J. (2022). Biotic Interactions In Natural History Collections: Continuing to Extend Digital Records across Communities, Platforms, Collections, and Institutions (0.1). Zenodo. https://doi.org/10.5281/zenodo.6642868 .

A bit of a side topic, but why not use the same hash-based approach for intermediate files (e.g., IMF, descendant xml) ?

jhpoelen commented 2 years ago

@cboettig thought you might be interested in the proliferation of content ids in biodiversity data universe . . . ; )

jhpoelen commented 2 years ago

I've attempted to incorporate @ajacsherman and Cullen G. feedback by adding additional fields to preston's plazi-stream

$ time preston ls | preston plazi-stream | pv -l | grep "Handbook of the Mammals of the World" > hmw.json
[Fatal Error] :1:1: Content is not allowed in prolog.                                                               ]
[Fatal Error] :1:1: Content is not allowed in prolog.
 163k 0:08:58 [ 303 /s] [                                                                 <=>                       ]

real    8m58.641s
user    9m13.648s
sys 0m19.384s

On my 2011 laptop, I was able to process the a recent Plazi treatments-xml corpus is less than 20 minutes, generating attached file.

hmw.json.gz

with the first record being -

cat hmw.json.gz | gunzip | head -n1 | jq .
{
  "http://www.w3.org/ns/prov#wasDerivedFrom": "zip:hash://sha256/669b07bf81a1e35383e3d83458751684d7416b0b75f4f425f8476a44b1119f42!/03D587F2FFC94C03F8F13AECFBD8F765.xml",
  "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "application/plazi+xml",
  "docId": "03D587F2FFC94C03F8F13AECFBD8F765",
  "docName": "hbmw-9.emballorunidae.pdf.imd",
  "docOrigin": "Handbook of the Mammals of the World, Vol. 9, Lyny Edicions",
  "docMasterId": "hash://md5/ffecff8affcf4c04ffa53577fff8ffe9",
  "docISBN": "978-84-16728-19-0",
  "docPageNumber": "355",
  "commonNames": "Troughton’s Sheath-tailed Bat @en | Taphien de Troughton @fr | Troughton-Grabfledermaus @de | Tafozo de Troughton @es | Troughton's Tomb Bat @en",
  "interpretedAuthorityName": "Tate",
  "interpretedAuthorityYear": "1952",
  "interpretedClass": "Mammalia",
  "interpretedFamily": "Emballonuridae",
  "interpretedGenus": "Taphozous",
  "interpretedKingdom": "Animalia",
  "interpretedOrder": "Chiroptera",
  "interpretedPageId": "6",
  "interpretedPageNumber": "355",
  "interpretedPhylum": "Chordata",
  "interpretedRank": "species",
  "interpretedSpecies": "troughtoni",
  "name": "Taphozous troughtoni",
  "taxonomy": "Taphozous troughtoni Tate, 1952 , “ Rifle Creek, Mt. Isa, northwest Queensland ,” Australia .Taphozous troughtoni is in the subgenus Taphozous . It was considered ajunior synonym of T georgianus , but. T. Chimimba and D. J. Kitchener in 1991 raised it to a distinct species. Monotypic.",
  "subspeciesAndDistribution": "NE Australia endemic, in WC, C & E Queensland.",
  "distributionImageURL": "https://zenodo.org/record/3747930/files/figure.png",
  "foodAndFeeding": "Troughton’s Sheath-tailed Bats forage for insects well above tree canopies and high over open habitats. Large, high-flying grasshoppers are preferred food items and often taken back to cave roosts to eat.",
  "breeding": "No information.",
  "activityPatterns": "Troughton’s Sheath-tailed Bat roosts in caves, mines and tunnels, rock crevices, and rocky escarpments. Echolocation call is less than 25 kHz and distinguishes it from the Common Sheath-tailed Bat (. georgianus ) where they co-occur . Movements, Home range and Social organization. Large colonies of Troughton’s Sheath-tailed Bat can be found in landscapes with abundant rocky outcrops, especially in tower karst. Colony size might be limited by roosting structures, especially in more arid areas where there are few caves deep enough to support large colonies.",
  "bibliography": "Chimimba & Kitchener (1991) | Hall (2008b) | McKean & Price (1967) | Reardon & Thomson (2002) | Tate (1952) | Thomson et al. (2001) | Woinarski et al. (2014)",
  "movementsHomeRangeAndSocialOrganization": "Large colonies of Troughton’s Sheath-tailed Bat can be found in landscapes with abundant rocky outcrops, especially in tower karst. Colony size might be limited by roosting structures, especially in more arid areas where there are few caves deep enough to support large colonies.",
  "habitat": "Wide variety of habitats and bioregions of interior Queensland.",
  "statusAndConservation": "Classified as Least Concern on TheIUCNRed List. Troughton’s Sheath-tailed Bat has a large distribution and presumably large and stable overall population, uses a wide variety of habitats, occurs in protected areas, and does not face significant threats. It was originally recorded only from a small area in the Mount Isa Inland bioregion of Queensland, but recent studies based on isozymes and echolocation calls extend distribution further east throughout much of interior and near coastal region of central Queensland, formerly attributed to the Common Sheathtailed Bat. Recent reports of absence of Troughton’s Sheath-tailed Bat in western parts of its distribution require additional verification, possibly leading to re-evaluation of its conservation status after taxonomic issues are clarified.",
  "descriptiveNotes": "Head-body 79-4-86-3 mm, tail 31-5-36-9 mm, ear 22-4-27-1 mm, hindfoot 9-8-10-3 mm, forearm 73-76 mm; weight. 20-29 g. Dorsum of Troughton’s Sheath-tailed Bat is predominately olive­ brown, with pale mouse-gray guard hairs. Venter surface hairs are olive-brown from chin to shoulders and posteriorly dark yellow-brown, with guard hairs of pale mouse-gray. Uropatagium close to abdomen is heavily furred. Throat pouches are absent, and radio-metacarpal sacs are present in both sexes. Skin of rhinarium, wings, uropatagium, lips, face, and tragus are fuscous (pale yellow)."
}

and

$ zcat hmw.json.gz\
 |  jq -f schema.jq\
 | mlr --ijson --ocsv cat\
 > hmw.csv

with schema.jq -

{ 
  "docId" : .docId,
  "docOrigin" : .docOrigin,
  "docISBN" : .docISBN,
  "docName" : .docName,
  "docMasterId" : .docMasterId,
  "docPageNumer" : .docPageNumber,
  "derivedFrom" : .["http://www.w3.org/ns/prov#wasDerivedFrom"],
  "name" : .name,
  "interpretedGenus" : .interpretedGenus,
  "interpretedSpecies" : .interpretedSpecies,
  "commonNames" : .commonNames,
  "taxonomy" : .taxonomy,
  "subspeciesAndDistribution" : .subspeciesAndDistribution,
  "descriptiveNotes": .descriptiveNotes,
  "habitat" : .habitat,
  "foodAndFeeding" : .foodAndFeeding,
  "breeding" : .breeding,
  "activityPatterns" : .activityPatterns,
  "movementsHomeRangeAndSocialOrganization" : .movementsHomeRangeAndSocialOrganization,
  "statusAndConservation": .statusAndConservation,
  "bibliography" : .bibliography,
  "distributionImageURL" : .distributionImageURL
}

hmw.csv

also see first three rows at -

docId docOrigin docISBN docName docMasterId docPageNumer derivedFrom name interpretedGenus interpretedSpecies commonNames taxonomy subspeciesAndDistribution descriptiveNotes habitat foodAndFeeding breeding activityPatterns movementsHomeRangeAndSocialOrganization statusAndConservation bibliography distributionImageURL
03D587F2FFC94C03F8F13AECFBD8F765 Handbook of the Mammals of the World, Vol. 9, Lyny Edicions 978-84-16728-19-0 hbmw-9.emballorunidae.pdf.imd hash://md5/ffecff8affcf4c04ffa53577fff8ffe9 355 zip:hash://sha256/669b07bf81a1e35383e3d83458751684d7416b0b75f4f425f8476a44b1119f42!/03D587F2FFC94C03F8F13AECFBD8F765.xml Taphozous troughtoni Taphozous troughtoni Troughton’s Sheath-tailed Bat @en | Taphien de Troughton @fr | Troughton-Grabfledermaus @de | Tafozo de Troughton @es | Troughton's Tomb Bat @en Taphozous troughtoni Tate, 1952 , “ Rifle Creek, Mt. Isa, northwest Queensland ,” Australia .Taphozous troughtoni is in the subgenus Taphozous . It was considered ajunior synonym of T georgianus , but. T. Chimimba and D. J. Kitchener in 1991 raised it to a distinct species. Monotypic. NE Australia endemic, in WC, C & E Queensland. Head-body 79-4-86-3 mm, tail 31-5-36-9 mm, ear 22-4-27-1 mm, hindfoot 9-8-10-3 mm, forearm 73-76 mm; weight. 20-29 g. Dorsum of Troughton’s Sheath-tailed Bat is predominately olive­ brown, with pale mouse-gray guard hairs. Venter surface hairs are olive-brown from chin to shoulders and posteriorly dark yellow-brown, with guard hairs of pale mouse-gray. Uropatagium close to abdomen is heavily furred. Throat pouches are absent, and radio-metacarpal sacs are present in both sexes. Skin of rhinarium, wings, uropatagium, lips, face, and tragus are fuscous (pale yellow). Wide variety of habitats and bioregions of interior Queensland. Troughton’s Sheath-tailed Bats forage for insects well above tree canopies and high over open habitats. Large, high-flying grasshoppers are preferred food items and often taken back to cave roosts to eat. No information. Troughton’s Sheath-tailed Bat roosts in caves, mines and tunnels, rock crevices, and rocky escarpments. Echolocation call is less than 25 kHz and distinguishes it from the Common Sheath-tailed Bat (. georgianus ) where they co-occur . Movements, Home range and Social organization. Large colonies of Troughton’s Sheath-tailed Bat can be found in landscapes with abundant rocky outcrops, especially in tower karst. Colony size might be limited by roosting structures, especially in more arid areas where there are few caves deep enough to support large colonies. Large colonies of Troughton’s Sheath-tailed Bat can be found in landscapes with abundant rocky outcrops, especially in tower karst. Colony size might be limited by roosting structures, especially in more arid areas where there are few caves deep enough to support large colonies. Classified as Least Concern on TheIUCNRed List. Troughton’s Sheath-tailed Bat has a large distribution and presumably large and stable overall population, uses a wide variety of habitats, occurs in protected areas, and does not face significant threats. It was originally recorded only from a small area in the Mount Isa Inland bioregion of Queensland, but recent studies based on isozymes and echolocation calls extend distribution further east throughout much of interior and near coastal region of central Queensland, formerly attributed to the Common Sheathtailed Bat. Recent reports of absence of Troughton’s Sheath-tailed Bat in western parts of its distribution require additional verification, possibly leading to re-evaluation of its conservation status after taxonomic issues are clarified. Chimimba & Kitchener (1991) | Hall (2008b) | McKean & Price (1967) | Reardon & Thomson (2002) | Tate (1952) | Thomson et al. (2001) | Woinarski et al. (2014) https://zenodo.org/record/3747930/files/figure.png
0131878A0720FF8FFFAEF82F62E84EE9 Handbook of the Mammals of the World – Volume 6 Lagomorphs and Rodents I, Barcelona: Lynx Edicions 978-84-941892-3-4 hbmw_6_Geomyidae_0234.pdf.imf hash://md5/fd08fff2072cff83fff3fff96b0f4602 261 zip:hash://sha256/5146700132c798f057756c6fde84a3d4c426bdc372dbec6ba18ce4125aa8353b!/treatments-xml-main/data/01/31/87/0131878A0720FF8FFFAEF82F62E84EE9.xml Geomys pinetis Geomys pinetis Gaufre des pinédes @fr | Stidostliche Taschenratte @de | Tuza suroriental @es | Sandy Mounder; Colonial Pocket Gopher (colonus) , Cumberland @en | sland Pocket Gopher (cumberlandius) , Goff's Pocket Gopher (goffi) , Sherman's Pocket Gopher (fontanelus) @en       Sandy, well-drained soils in habitats dominated by longleaf pines (Pinus palustris), turkey oaks (Quercus laevis), or live oaks (Q. virginiana). There is no specific information available for this species, but the South-eastern Pocket Gopher probably feeds on roots, tubers, stems, and leaves of most plants available within the vicinity ofits burrow system. It readily invades cultivated fields and is considered an agricultural pest wherever it occurs in contact with humans. As in all other pocket gophers, the burrow system is a series of shallow feeding tunnels radiating spoke-like from a deeper, central network that contains one or more nest chambers and several smaller chambers for storage of food or fecal pellets. The South-eastern Pocket Gopher appears to breed throughout the year, with major peaks in February-March and June-August. Each female produces 1-2 litters/year, and litters have 1-3 young. Young are weaned in c.30 days and reach reproductive maturity in 4-6 months. There is no specific information available for this species, but the South-eastern Pocket Gopheris probably active at any hour of the day, with periods of peak activity around dawn and dusk. It does not hibernate and is active yearround.   Classified as Least Concern on The IUCN Red List. Chambers et al. (2009) | Harper (1952) | Linzey & NatureServe (Hammerson) (2008p) | Patton (2005b) | Pembleton & Williams (1978) | Sherman (1940 | 1944) | Sudman et al. (2006) | Williams | S.L. (1999c)  
0131878A0722FF8CFA91F446685A4FCD Handbook of the Mammals of the World – Volume 6 Lagomorphs and Rodents I, Barcelona: Lynx Edicions 978-84-941892-3-4 hbmw_6_Geomyidae_0234.pdf.imf hash://md5/fd08fff2072cff83fff3fff96b0f4602 263 zip:hash://sha256/5146700132c798f057756c6fde84a3d4c426bdc372dbec6ba18ce4125aa8353b!/treatments-xml-main/data/01/31/87/0131878A0722FF8CFA91F446685A4FCD.xml Heterogeomys hispidus Heterogeomys hispidus Gaufre hérissé @fr | Borstige Taschenratte @de | Tuza hirsuta @es       Well-drained soils in a wide variety of habitats ranging from perennial tropical forests at high elevations to arid thornscrub habitats at low elevations. Elevational range extends from near sea level to ¢.2500 m. There is no specific information available for this species, but the Hispid Pocket Gopher probably feeds on roots, tubers, stems, and leaves of most plants available within the vicinity of its burrow system. It readily invades cultivated fields and is considered an agricultural pest whereverit occurs in contact with humans. As in all other pocket gophers, the burrow system is a series of shallow feeding tunnels radiating spoke-like from a deeper, central network that contains one or more nest chambers and several smaller chambers for storage of food or fecal pellets. Burrow systems of the Hispid Pocket Gopher can be 60 m or more in length and exceed 1 m in depth. The Hispid Pocket Gopher breeds year-round, with increased activity in October—June. Most females have two young perlitter. There is no specific information available for this species, but the Hispid Pocket Gopher is probably active at any hour ofthe day, with periods of peak activity around dawn and dusk. It does not hibernate and is active year-round.   Classified as Least Concern on The [UCN Red List (as Orthogeomys hispidus ). Ceballos (2014) | Hafner (1983) | Hafner & Hafner (1987) | Merriam (1895) | Patton (2005b) | Spradling et al. (2016) | Vazquez | Emmons | Reid & Cuarén (2008d)  
gsautter commented 2 years ago

for the master pdf related to plazi doc id 03BD87A2C660A212FF52F3F7F35547D7 with content hash zip:hash//sha256/5146700132c798f057756c6fde84a3d4c426bdc372dbec6ba18ce4125aa8353b!/treatments-xml-main/data/03/BD/87/03BD87A2C660A212FF52F3F7F35547D7.xml .

It could be (think I observed it at some point) that Zenodo actually exposes a different hash than what I get from using the JRE provided MD5 digest ... the latter is reproducible and stable across all GGI installations I'm aware of, though. However, there is another way: we also include the IMF ID from our end as an alternative identifier in the Zenodo depositions, in this case urn:lsid:plazi.org:pub:FFECFF8AFFCF4C04FFA53577FFF8FFE9 (as an LSID) and http://publication.plazi.org/id/FFECFF8AFFCF4C04FFA53577FFF8FFE9 as an HTTP URI ... hope that helps.

But I guess the master pdf is either not on Zenodo, or Zenodo doesn't expose the md5 of "locked"/"closed" files. I wonder which one it is . . . it'd be fun to know that the file exists on Zenodo, but cannot be accessed somehow.

Not sure, but maybe you can search via the two "alternative identifiers" (Zenodo nomenclature) above ... here the JSON of the respective deposition: https://zenodo.org/record/3740269/export/json#.Yqpvsp1BzQU (this JSON is always accessible, open deposition or not, as the latter only applies to the PDF proper)

jhpoelen commented 2 years ago

@ajacsherman @myrmoteras et al. - I've created a separate data publication for hmw corpus as seen by Plazi via https://github.com/plazi/treatments-xml at https://github.com/jhpoelen/hmw . You can find a sample table at https://github.com/jhpoelen/hmw/blob/main/hmw-sample.csv with first 10 records. See github repo for more data.

Am closing this issue because the first corpus of hmw has been built. Please open new issues that come up related to the hmw corpus at https://github.com/jhpoelen/hmw . Thanks for your feedback and input, and looking for to continuing the discussion after your review.