EOL / ContentImport

A placeholder for DATA tickets everytime Jira is un-available.
1 stars 1 forks source link

#6 Data Hubs data coverage: NCBI, GGBN, GBIF, BHL, BOLDS, iNat #6

Open jhammock opened 2 months ago

jhammock commented 2 months ago

The resource is at https://opendata.eol.org/dataset/metrics/resource/713e1aa9-c0d5-4a73-9cb9-bffc86f94a4d

I think it's been static since 2015, so the services to consult may be very different now.

The hubs to query still seem reasonable to me, though: NCBI, GGBN, GBIF, BHL, and BOLD. @KatjaSchulz does anything else come to mind? Should we include iNat? The other major sources of GBIF data- collections- are at least represented on their own through the "type specimen repository" resources. TreatmentBank? Maybe not! We do need to draw the line somewhere...

Finally, we did this at the family level originally. Is it time to consider going more granular?

Eli, this should probably go on a schedule once it's updated. I'd say quarterly or monthly, depending on how heavy it is to run.

KatjaSchulz commented 2 months ago

Some of iNat is included in GBIF, but it would probably be interesting to see the total number of iNat observations for a taxon, at least all the verifiable ones. The observations/histogram API may be a good way to get these, see https://api.inaturalist.org/v1/docs/#!/Observations/get_observations_histogram

For example, this query gets you the observation counts for the family Muridae broken down by month of year: https://api.inaturalist.org/v1/observations/histogram?taxon_is_active=true&verifiable=true&taxon_id=44185&date_field=observed&interval=month_of_year

If you add up the months, you get the all time count for verifiable observations. It's possible there are better ways to do this. We'd have to poke around.

KatjaSchulz commented 2 months ago

We could go more granular, the question is whether it's worth it? How easy would it be to adapt our current process to include genera, for example?

eliagbayani commented 2 months ago

Background: Resource DwCA was last generated Aug 4, 2018 Here are the current measurementTypes we're getting from respective databases. We either used an API service or a webpage service (scraped) whichever was available.

BOLDS "http://eol.org/schema/terms/NumberRecordsInBOLD" "http://eol.org/schema/terms/RecordInBOLD" (boolean) "http://eol.org/schema/terms/NumberPublicRecordsInBOLD"

BHL "http://eol.org/schema/terms/NumberReferencesInBHL" "http://eol.org/schema/terms/ReferenceInBHL" (boolean)

GBIF "http://eol.org/schema/terms/NumberRecordsInGBIF" "http://eol.org/schema/terms/RecordInGBIF" (boolean)

GGBN "http://eol.org/schema/terms/NumberDNARecordsInGGBN" "http://eol.org/schema/terms/NumberSpecimensInGGBN" "http://eol.org/schema/terms/SpecimensInGGBN" (boolean)

NCBI "http://eol.org/schema/terms/NumberOfSequencesInGenBank" "http://eol.org/schema/terms/SequenceInGenBank" (boolean)

Next:

And yes, we can add more databases and more measurementTypes if available. e.g. iNat as mentioned) And yes, we can be more granular (e.g. genera) if databases provide service(s) for it or if we are willing enough to massage/assemble parts to get the numbers we want. I'm almost sure we can go species level for "NumberRecordsInGBIF", just a thought. I will update as I continue. Thanks.

eliagbayani commented 2 months ago

For GGBN we're only getting "DNA" and "specimen" counts. But there is also now "tissue". e.g. https://data.ggbn.org/ggbn_portal/api/search?getSampletype&name=Ophiuridae https://data.ggbn.org/ggbn_portal/api/search?getSampletype&name=Asteriidae

Which they have a big coverage as well, see here. "DNA", "specimen" and "tissue" have the big numbers.

The genus level is not cumulative, not reliable. https://data.ggbn.org/ggbn_portal/api/search?getSampletype&name=Panthera The species level is more promising. https://data.ggbn.org/ggbn_portal/api/search?getSampletype&name=Panthera%20leo Thanks.

jhammock commented 2 months ago

I'm inclined to simplify our data types to just one per source, removing the boolean, which seems redundant, and... counting only the public records in BOLD, I think.

I'd quite like to do this at the species level if you think it's practical, @eliagbayani . That may vary among data hubs, of course. If we want to get fancy, we could roll up our own counts to family level, but frankly, a user could do that better on our data search page, so I'm not sure we even need to.

@KatjaSchulz thoughts on any of that?

KatjaSchulz commented 2 months ago

Sounds good to me.

eliagbayani commented 2 months ago

What is the measurementType to use for the total annual observation count mentioned here for iNat ?

KatjaSchulz commented 2 months ago

We'll have to create a new one. How about something like http://eol.org/schema/terms/NumberOfiNaturalistObservations?

@jhammock ?

jhammock commented 2 months ago

Sounds good to me! Will you do the honors, @KatjaSchulz ?

KatjaSchulz commented 2 months ago

Yes, I have some new terms I want to add for the nematode stuff anyway.

KatjaSchulz commented 2 months ago

The terms doc should be updated now, but Jeremy cautions that admin workflows are still a work in progress with the new code, so please let him know if there are any terms-related issues.

jhammock commented 2 months ago

Using archives rather than APIs sounds like a good idea; I expect many of them do publish snapshots of some kind.

eliagbayani commented 2 months ago

Update for iNat:

So we may need another measurement aside from: http://eol.org/schema/terms/NumberOfiNaturalistObservations We may need another one for their research-grade observation count? Thanks.

jhammock commented 2 months ago

I favor providing counts only of the research grade observations, but I defer to @KatjaSchulz, since she knows the community. I was also going to say that providing the genus and family level counts feels redundant, but on second thought, those counts would be good content for the genus and family pages, providing a sense of how... "familiar" a group is. Katja, what do you think?

KatjaSchulz commented 2 months ago

I could go either way. If it's easier to use the GBIF export, we can do just the research grade observations. This means we won't get records for rare species that have observation with only a single ID. This often happens in groups that have few specialist identifiers. Many of my fly observations are not research grade, because I am the only one on iNat who knows how to identify that species. On the other hand, we avoid counting a bunch of misidentifications, so it's a bit of a trade-off. Either way, we could use the http://eol.org/schema/terms/NumberOfiNaturalistObservations uri and adjust the definition to what we are actually counting.

jhammock commented 2 months ago

FTR, I have no objection to keeping each data hub in a separate resource. Whichever you prefer, @eliagbayani

eliagbayani commented 2 months ago

Update: We now have an iNaturalist data coverage resource. OpenData This is research-grade observation counts for species, genus and family level taxa. Thu 2024-05-02 {"MoF.tab":481602, "occurrence.tab":481602, "taxon.tab":481602} Thanks.

jhammock commented 2 months ago

The number of taxa and the range of values look about as I expected, and the source links look like the form we want. @KatjaSchulz anything amiss? Any other fields we aught to include?

KatjaSchulz commented 2 months ago

Unfortunately, these data are all significant undercounts of research grade observations. Sorry, I actually knew this but didn't think about it when you mentioned getting the data from the iNat for GBIF DwC-A. iNat only includes research grade observation where the observation data are released under a GBIF compatible license and apparently, there are quite a few observations with incompatible licenses, for example:

| taxon | Eli's resource | research grade obs | link | | Ophion | 987 | 1,334 | Ophion RG obs | | Homalonychus | 345 | 481 | Homalonychus RG obs | | Taricha | 66,061 | 78,614 | Taricha RG obs | | Euphorbia | 222,132 | 306,993 | Euphorbia RG obs |

So I don't think we can get good data from the GBIF dump. Is getting the data from the API feasible?

jhammock commented 2 months ago

I have no strong feeling about whether we should count all observations, research grade observations, or open access research grade observations, though I would prefer to count the same thing at each taxonomic rank. I defer to @KatjaSchulz 's preference (as well as what is practical).

eliagbayani commented 2 months ago

@KatjaSchulz , my first choice was also the API. But I was advised very early on by Ken-ichi to use the dumps instead because my API calls are too many. Ophion e.g. https://api.inaturalist.org/v1/observations?quality_grade=research&taxon_id=47993

KatjaSchulz commented 2 months ago

Hm, I guess their API is not designed to download large amounts of data. Let me look at their API documentation again to see if there is a way to ask for number of obs or RG obs without doing gazillions of individual calls. I didn't see any when I looked last time. If I don't find anything, I can also ask in the iNat forum. I really don't think the numbers we get from the GBIF archive are good enough. It's one thing if data are a little out of date, but if we start out with arbitrarily biased & incomplete data, that seriously decreases the usefulness of having those data.

jhammock commented 2 months ago

I'm happy to wait, but I wouldn't call the data arbitrarily biased or incomplete; I'd call it the number of open access research grade records. If it's what GBIF wants, it could be what some of our users want...

On Thu, May 2, 2024 at 5:12 PM Katja Schulz @.***> wrote:

Hm, I guess their API is not designed to download large amounts of data. Let me look at their API documentation again to see if there is a way to ask for number of obs or RG obs without doing gazillions of individual calls. I didn't see any when I looked last time. If I don't find anything, I can also ask in the iNat forum. I really don't think the numbers we get from the GBIF archive are good enough. It's one thing if data are a little out of date, but if we start out with arbitrarily biased & incomplete data, that seriously decreases the usefulness of having those data.

— Reply to this email directly, view it on GitHub https://github.com/EOL/ContentImport/issues/6#issuecomment-2091627527, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXC5B3WFD274KCTLM66YXLZAKT3RAVCNFSM6AAAAABGNLV6D2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJRGYZDONJSG4 . You are receiving this because you were mentioned.Message ID: @.***>

KatjaSchulz commented 2 months ago

Here's something that may work. Check out this query: https://api.inaturalist.org/v1/observations/species_counts?taxon_is_active=true&hrank=species&lrank=species&iconic_taxa=Amphibia&quality_grade=research

This has counts for the research grade observations of all species in Amphibia. Since I am just playing, I haven't tried what happens if you don't specify Amphibia and just ask for counts for all species. I bet there's a way to get the data we need from this API method. Either by running the full query or by chopping it up somehow, e.g., running separate queries for these iconic_taxa: Actinopterygii, Amphibia, Arachnida, Aves, Chromista, Fungi, Insecta (will be huge), Mammalia, Mollusca, Reptilia, Plantae (will be huge), Protozoa, unknown. If the Insecta & Plantae queries don't go through, we could try chopping them up using the d1 & d2 parameters. More info here: https://api.inaturalist.org/v1/docs/#!/Observations/get_observations_species_counts

Once we have put together the initial data set, we could pare things down and only ask for newly added observations when the resource is reharvested.

Also, the query above gives us all the RG observations. To also get the metric for verifiable observations, we could ask for a second dump with quality_grade=needs_id and then add the research + needs_id counts.

Could you have a look at this and let me know what you think?

eliagbayani commented 2 months ago

@KatjaSchulz Hi Katja, Using your query, not much to change there except for moving from page to page. I generated 3 files, for the different quality_grades ("research", "needs_id", "casual"). I added "casual" since I'm already there, might as well generate a 3rd dump. Please extract: datahub_inat.zip If good enough, which one to use to make the final DwCA.

I'm not sure I understood the adding of "research" + "needs_id" counts. Thanks.

KatjaSchulz commented 2 months ago

Hi Eli,

This looks great! The numbers don't add up exactly. For example, right now, iNat thinks it has 387,165 species with at least 1 research grade observation, but the we only get 372,091 in the API response. But I think we don't have to worry too much about this discrepancy. We've got the vast majority of species.

Scratch the adding of "research" + "needs_id" counts. I suggest that we use two measurement types for iNat:

http://eol.org/schema/terms/NumberOfRGiNatObservations - for research grade observations, from the datahub_inat_grade_research.txt doc

http://eol.org/schema/terms/NumberOfiNaturalistObservations - for total observations, adding the counts from all three grades together, i.e., research + needs_id + casual

That's more straightforward. I'll add the second measurement type to the terms doc. Let me know if you have any questions.

eliagbayani commented 2 months ago

Thanks Katja. Will proceed with these 2 measurement types.

By the way, I've only included species-level taxa in the dumps I submitted. I will include family and genus level taxa in our final version.

eliagbayani commented 2 months ago

@KatjaSchulz DwCA generated, for review. OpenData {"MoF.tab":897218, "occurrences.tab":897218, "taxon.tab":560902} Has both measurement types: http://eol.org/schema/terms/NumberOfRGiNatObservations http://eol.org/schema/terms/NumberOfiNaturalistObservations

Taxa includes : species, genus and family. Thanks.

KatjaSchulz commented 2 months ago

Hi Eli,

Something's not adding up. For example, for the species Megalodacne fasciata, we have the following data:

NumberOfRGiNatObservations 1531 NumberOfiNaturalistObservations 4594

However, the doesn't jive with what's on the web site where I get 1500 research grade observations and 1531 total observations.

When use the species_counts API and ask for research grade observatoins, I get the correct numbers (actually off by 1, but close enough): https://api.inaturalist.org/v1/observations/species_counts?taxon_id=83088&quality_grade=research

It looks like the "count":1501 variable gives me what I asked for (number of RG obs) and it also has the total number of obs in the response: "observations_count":1532.

When I ask for needs_id obs, I get "count": 11 (which is correct) & "observations_count":1532 https://api.inaturalist.org/v1/observations/species_counts?taxon_id=83088&quality_grade=needs_id

And when I ask for casual, I get "count": 20 (which is correct) & "observations_count":1532 https://api.inaturalist.org/v1/observations/species_counts?taxon_id=83088&quality_grade=casual

Based on these observations, we should be able to simplify the process and get all the data we need just from the research grade query using the "count" variable for NumberOfRGiNatObservations and the "observations_count" variable for NumberOfiNaturalistObservations.

It looks like this will work for the species, but the API won't give you any data on RG obs for genera and families. So we'll just have to make due with NumberOfiNaturalistObservations for the higher ranks. I think the easiest way to get them is to do a rank-based query like this: https://api.inaturalist.org/v1/observations/species_counts?rank=family And then get the NumberOfiNaturalistObservations from "observations_count" I'm not sure what exactly the "count" variable represents in this case. It's lower, so there must be some kind of a filter in place that's not user-specified. In any case, the "observations_count" value is closer to the value you get on the website, although it's not exactly the same.

Does all of this make sense to you?

eliagbayani commented 2 months ago

@KatjaSchulz Take 2: DwCA updated, for review. OpenData {"MoF.tab":811041, "occurrences.tab":811041, "taxon.tab":485171} Note: This is generated today, fresh API calls. Slight difference between web interface numbers and API numbers. Thanks.

KatjaSchulz commented 2 months ago

Yes, we got it now. The numbers can change by the minute, especially for commonly observed taxa, but I think this is as close as we can get. It may be a good idea to provide the date of data capture in the metadata. @jhammock Do we have a metadata field that would be good for that?

eliagbayani commented 2 months ago

Update on next database: GBIF: Using their download service, and using their download format SPECIES_LIST. It gives a CSV file with all the taxa in GBIF with corresponding numberOfOccurrences.

The numberOfOccurrences value for species-level taxa is consistent with the API call we originally use for family and genus. e.g. http://api.gbif.org/v1/occurrence/count?taxonKey=8084280

For higher-level taxa (family, genus), the numberOfOccurrences is unreliable and I will stick with the original API for the count. e.g. http://api.gbif.org/v1/occurrence/count?taxonKey=3701

eliagbayani commented 2 months ago

Update: GBIF data coverage resource. DwCA has finally been generated. For review. OpenData {"MoF.tab":2720477, "occurrence.tab":2720477, "taxon.tab":2725539}

measurement type = http://eol.org/schema/terms/NumberRecordsInGBIF

This includes taxa with the following ranks: "species", "form", "variety", "subspecies", "unranked" -> counts from CSV dump "family", "genus" -> counts from API call

jhammock commented 2 months ago

Thanks, @eliagbayani !

@KatjaSchulz do you think any revisions are needed to the definitions of these terms as we add the lower rank taxa? This one is " number of records for taxa in this clade in the Global Biodiversity Information Facility (GBIF) ". It seems ok to me; "clade" is technically correct for species...

The resource format looks good to me. I wonder if we should filter out the unranked taxa. The sample I looked at included mostly "sp.", "cf.", etc.

KatjaSchulz commented 2 months ago

If we were starting from scratch, I would probably argue for using "group" instead of "clade", because not all of our taxa are actually clades. Many families and genera are known to be not monophyletic, so referring to them as clades is not correct, strictly speaking. But I don't feel very strongly about this and I think it's ok to leave it as is. If you also prefer "group" we can change it the next time one of us gets into the terms doc. And yes, species are clades/groups, so there's no change needed due to the inclusion of species.

jhammock commented 2 months ago

Group sounds good; next time I have the file open I'll check on all the similar terms. Eli, let's exclude the unranked taxa; it looks like we want almost none of them.

eliagbayani commented 2 months ago

@jhammock GBIF resource updated. OpenData Excluded taxa with taxonRank equal to 'unranked' and its corresponding MoF records. {"MoF.tab":1901538, "occurrence.tab":1901538, "taxon.tab":1905511) Thank you.

jhammock commented 2 months ago

the GBIF resource should be good to go now, I think. I don't know what your current harvest conditions are, Eli. I won't call these urgent if it's not a good time :)

eliagbayani commented 1 month ago

@jhammock BOLDS

jhammock commented 1 month ago

That looks grand! No unusual ranks, source links make sense to me.

eliagbayani commented 1 month ago

First of the six databases: iNat is now published. https://www.eol.org/pages/163693/data?resource_id=1177 https://www.eol.org/pages/1045608/data?resource_id=1177

@JRice , the two consecutive republish works. Thanks.

jhammock commented 1 month ago

The records look good to me! 522k records (in the publishing resource page) squares with the data search result. The resource file appears to contain 811k MoF records, though.

eliagbayani commented 1 month ago

This is an example of taxon that didn't get published: Rhipidura dedemi No distinguishing criteria from resource file why it was excluded. It should have entries for:

Reported some warnings in the logs..

eliagbayani commented 1 month ago

2nd of the six databases: GBIF is now published. https://eol.org/pages/1051117/data?resource_id=1178 https://eol.org/pages/328450/data?resource_id=1178 https://eol.org/pages/46564415/data?resource_id=1178

Weird numbers though in publishing page since DwCA only has

jhammock commented 1 month ago

Those look good to me! I'm not worried about the publish page. Those counts are known to sometimes exaggerate by ~x10.

eliagbayani commented 1 month ago

@jhammock BOLDS

  • Finally generated DwCA OpenData For review. {"MoF.tab":527491, "occurrences.tab":527491, "taxon.tab":622866} Only includes species-level taxa for now. Genus and family levels to follow.

BOLDS Resource now has genus and family levels in addition to the original species level taxa. {"MoF.tab":513111, "occurrence.tab":513111, "taxon.tab":524701} OpenData

jhammock commented 1 month ago

Looks grand!

Let's keep an eye on how this one maps when it's harvested:

1081891 440900 [myrmecia] bisecta species

Its the only one so formatted, and I don't want to load down the connector with a lot of format manipulations if it's not needed. If it comes in as a child of https://eol.org/pages/50788, I'll be proud of our names matching code and say let BOLD have diverse canonical formats if they like.

jhammock commented 1 month ago

FTR, I don't object to ditching counts and using boolean+link for NCBI species level. We've been successful with our granular ambitions so far but we haven't promised anyone anything!

jhammock commented 1 month ago

The BHL resource looks good to me. I will be interested to see how well the names map with only their canonicals. @KatjaSchulz any general purpose precautions you want to take? I think we're currently including all names they made available. I wonder if we aught to filter out high risk names- monomials, maybe? Or monomials without Family-indicating suffix?

The count, it turns out, is the number of pages on which a name is mentioned, so may include multiple pages in the same document. That metric is fine with me, but do we want to tweak the definition of this measurementType to make this clearer? I should adjust that definition anyway because it implies that we're aggregating eg: species into the counts for their genus, etc. Can you confirm, Eli, that's not true in this case? My first test or two suggest not, and it doesn't seem like it would be practical.

Currently: "number of references to taxa in this clade in the Biodiversity Heritage Library"

KatjaSchulz commented 1 month ago

I wonder how useful anything other than species is for BHL counts? We could do some quality control on species homonyms, e.g., by looking at the work in which the name is used. But if we include genera, the number of homonyms gets to be quite huge.