#6 Data Hubs data coverage: NCBI, GGBN, GBIF, BHL, BOLDS, iNat

jhammock commented 4 months ago

The resource is at https://opendata.eol.org/dataset/metrics/resource/713e1aa9-c0d5-4a73-9cb9-bffc86f94a4d

I think it's been static since 2015, so the services to consult may be very different now.

The hubs to query still seem reasonable to me, though: NCBI, GGBN, GBIF, BHL, and BOLD. @KatjaSchulz does anything else come to mind? Should we include iNat? The other major sources of GBIF data- collections- are at least represented on their own through the "type specimen repository" resources. TreatmentBank? Maybe not! We do need to draw the line somewhere...

Finally, we did this at the family level originally. Is it time to consider going more granular?

Eli, this should probably go on a schedule once it's updated. I'd say quarterly or monthly, depending on how heavy it is to run.

jhammock commented 2 months ago

Let's ditch all the monomials, then. Any string containing no spaces should do, yes?

I think (Eli, correct me if I'm wrong!) that it would be quite a few extra steps to even just fish the titles out of the source links so we can process them. I'm inclined to say for this purpose it's not really worth it, and we can leave the remaining homonym disambiguation as an exercise for the user. Willing to be outvoted, though!

KatjaSchulz commented 2 months ago

I think it's fine to publish a first stab at this and to improve things as resources permit. I can do some homonym diagnostics on the resource once it is published, so we can get an idea of the scale of the problem and can think about ways to mitigate it.

eliagbayani commented 2 months ago

The BHL resource looks good to me. I will be interested to see how well the names map with only their canonicals. @KatjaSchulz any general purpose precautions you want to take? I think we're currently including all names they made available. I wonder if we aught to filter out high risk names- monomials, maybe? Or monomials without Family-indicating suffix?

The count, it turns out, is the number of pages on which a name is mentioned, so may include multiple pages in the same document. That metric is fine with me, but do we want to tweak the definition of this measurementType to make this clearer? I should adjust that definition anyway because it implies that we're aggregating eg: species into the counts for their genus, etc. Can you confirm, Eli, that's not true in this case? My first test or two suggest not, and it doesn't seem like it would be practical.

Currently: "number of references to taxa in this clade in the Biodiversity Heritage Library"

@jhammock

yes, the total count for genus is NOT the sum of total counts of its children.
yes the count is the number of pages on which a name is mentioned, so may include multiple pages in the same document.

eliagbayani commented 2 months ago

Let's ditch all the monomials, then. Any string containing no spaces should do, yes?

I think (Eli, correct me if I'm wrong!) that it would be quite a few extra steps to even just fish the titles out of the source links so we can process them. I'm inclined to say for this purpose it's not really worth it, and we can leave the remaining homonym disambiguation as an exercise for the user. Willing to be outvoted, though!

@jhammock

it should be possible but I have to check if there is a dump that can be used to compute what titles/documents a name is used. Doing this via API (will check if there is one) will be too many calls.

jhammock commented 2 months ago

Cool; let's follow Katja's approach and take a first stab with just filtering out the monomials, but if you can find a download with the title info, that'll be worth examining for the next level of homonym hunting.

KatjaSchulz commented 2 months ago

@eliagbayani Once we have the resource, I can pull out the names that are known homonyms and then get the title data for this much smaller sample via the API. I'll let you know if I need your help with that.

eliagbayani commented 2 months ago

BHL

Monomials removed OpenData For review. {"MoF.tab":3470737, "occurrence.tab":3470737, "taxon.tab":3470737}
currently exploring if we can get what titles a name is mentioned.

Thanks.

eliagbayani commented 2 months ago

@KatjaSchulz BHL There is no file dump to get the titles where a scientificName is mentioned. But there is an API call for it: https://www.biodiversitylibrary.org/api3?op=GetNameMetadata&name=gadus+morhua&format=xml&apikey=deabdd14-65fb-4cde-8c36-93dc2a5de1d8 Maybe we can just run this API call to those suspected names with issues.

KatjaSchulz commented 2 months ago

Yes I think that's a good plan.

eliagbayani commented 2 months ago

NCBI Finally, DwCA generated. OpenData {"MoF.tab":688905, "occurrence.tab":688905, "taxon.tab":706834} For harvesting.

jhammock commented 2 months ago

The NCBI resource looks lovely to me. What a nicely structured classification! ;)

eliagbayani commented 2 months ago

The NCBI resource looks lovely to me. What a nicely structured classification! ;)

Yes, you need to link together two separate dump files (names, nodes) from their downloads to build the classification.

EOL / ContentImport

#6 Data Hubs data coverage: NCBI, GGBN, GBIF, BHL, BOLDS, iNat #6