BiologicalRecordsCentre / iRecord

Repository to store and track enhancements, issues and tasks regarding the iRecord website.
http://irecord.org.uk
2 stars 1 forks source link

Investigate options for handling aggregate species names from iNaturalist #1222

Open kitenetter opened 2 years ago

kitenetter commented 2 years ago

Taxa that identified as aggregates on iNat are being shown as segregates on iRecord, due to iNat handling aggregates differently to UKSI. One example:

Notes from John: "The record example for Complex Mesapamea secalis is actually reported by the API as taxon name Mesapamea secalis, with rank set to Complex. It doesn’t have agg. at the end of the name which is the UKSI standard, nor does it have Complex in the name. We don’t check the rank as part of the name. If we check the rank, then we’ll need a mapping table to map Complex (which doesn’t exist in UKSI) to agg, and perhaps automatically tolerate the absence of agg. at the end of species names. But the whole thing of mapping correctly from the iNat taxonomy to UKSI is a whole can of worms so there might be quite a lot to do once we look into all the issues."

We need to find a resolution to this for verifiers. Options:

kitenetter commented 1 year ago

Related to this is the issue of mismatches between iNat and UKSI taxonomy, e.g. the robberfly that is Machimus atricapillus in the UKSI is called Tolmerus atricapillus in iNat, and consequently the records never reach iRecord. A dictionary mapping table may allow both the name issue and the aggregate issue to be addressed.

johnvanbreda commented 1 year ago

@kitenetter I agree - some sort of dictionary mapping table is needed to resolve these issues.

kitenetter commented 1 year ago

@robin-hutchinson I think the starting point here is to get a list of the iNat taxon names that have been recorded in the UK, and then do a comparison to see which ones have no match in the UKSI. That should be straightforward for synonyms, but may be harder for aggregates.

If you can see a way of extracting a UK recorded taxon list from iNat itself that would be fine, otherwise please can you contact Sophie Ratcliff at NBN and ask if she can provide a table of records per taxon from their iNat download.

robin-hutchinson commented 1 year ago

The full list of species recorded in the UK is available to download from the checklist pages: https://www.inaturalist.org/check_lists/7190-United-Kingdom-Check-List?view=plain. I am just downloading it to try to comparison now.

robin-hutchinson commented 1 year ago

I've spoken to Giselle and Sophie as the checklist did not have all of the taxa (only species level identifications). I've now downloaded all observations from iNat through the administrator dashboard so I can generate this list, and Sophie has also said this:

The UKSI is now part of the GBIF taxonomic backbone and the NHM in London are looking at mapping TVKs with other taxonomies, like iNaturalist using the GBIF taxonomic backbone. It might be worth talking to Chris Raper as the NHM might have something in place soon to help.

robin-hutchinson commented 1 year ago

Aggregate Matches.csv iNat UKSI Edited Match.csv iNat UKSI No Match.csv

I've looked at the iNat tables and found the aggregates that have fulfill the requirements to be shared with iRecord (research grade and CC-BY, CC0 or CC-BY-NC). I pulled out everything with a rank of "complex" and found the aggregate match if there is one - if there was not a match, I gave the species name and then listed the species within the iNat complex if this was under ten species. There are some cases where we might be able to record the species within the UK, as the other species within the complex are not present here (e.g. Austropotamobius pallipes complex).

I also exported a list of all of the iNat ids that exist within iRecord (regardless of verification status) to filter out names that are successfully moving to iRecord, and then tried to match them using the UKSI code. I've attached the taxa that have a match if the names are slightly edited (e.g. Deroceras invadens -> Deroceras (Deroceras) invadens, Amanita excelsa excelsa -> Amanita excelsa var. excelsa, Choerades marginata -> Choerades marginatus, Bothria subalpina -> Botria subalpina, Actenicerus siaelandicus -> Actenicerus sjaelandicus) or where the subspecies wasn't known, so it matched to the species instead (e.g. Rubus idaeus idaeus -> Rubus idaeus). It uses fuzzy matching up to two letters different to deal with spelling differences - this works when it is a list that I check for errors before using, but wouldn't work for the iNat-iRecord link as there isn't a manual checking stage. We might be able to resolve some of these issues using a similar process to the UKSI name-matching process in Indicia (https://uksi-sandbox.nhm.ac.uk/taxonmatch.php):

exact - there is an exact match on the name & authority with a NULL attribute name - there is an exact match on the name only subgenus - there is a match by either removing a bracketed subgenus in your name OR by allowing for a bracketed subgenus in your name gender - there is a match by looking for a different gender ending to the species epithet. We do this by chopping off 2 characters from the end of your name and adding a wildcard none - no match was found

There is also a spreadsheet for the taxa that there wasn't a match for. I will go through this final spreadsheet to separate out species with unknown synonyms (e.g. Pseudochorthippus parallelus) from species that are not in iRecord, and will pull out the incorrect matches from the edited spreadsheet to see if I can find a better match for those as well.

We could also incorporate the rank columns in the UKSI and Indicia to make sure that we only try to match on the same rank level? This would help to distinguish between the two Bombus lucorum entries in the UKSI (species and sensu lato, with different TVKs).

johnvanbreda commented 1 year ago

Thanks @robin-hutchinson. I think that it probably is a useful thing to restrict matches on rank, as there are a few cases of duplicate names across different ranks that I've come across.

From my perspective there are 2 tasks that need to come out of this issue - firstly to get a list of mappings from iNat names to UKSI names (or TVKs/organism keys) that is acceptable to the community. Your proposed name matching rules sound good from my perspective but I'm probably not the right person to decide what is acceptable. The 2nd task will be to then integrate this list into the existing iNat importer and use it to ensure that we are able to import all records rather than a sub-selection. It may then be necessary to re-import the data (at least unverified records) to ensure we pick up the missing stuff.

For the Saint Helena iNat link, I have been thinking that I should add code to the sync which catches unmapped names and adds them to a sandbox list, so the records are still importable and the sandbox list can be merged into the main list later. This might be useful here after the work on the mappings is done, as the iNat taxonomy may change in future so we'd want to be able to resolve future mapping requirements.

robin-hutchinson commented 1 year ago

Thanks, I'll keep working on suggested mappings. Looking at this today, I found that https://jumear.github.io/stirfry/iNatAPIv1_taxa?q=chorthippus&is_active=false&per_page=500&rank=species gives synonyms - I haven't yet found a way to search the taxon tables on synonym id (synonym_id=501625 doesn't produce the correct result) but if I figured out how to do this, could we add code to search for the synonyms in the UKSI if a match for the recommended name isn't found, or download and use a dictionary table for junior synonyms? I can only download 10000 at a time which doesn't include everything.

johnvanbreda commented 1 year ago

If you can work out a mapping from the accepted taxon name on iNat to a list of the iNat synonyms then we could use that to attempt matches for records that fail on the first attempt. But, as the iNat API for an observation does not include the synonyms for the record (at least not as far as I can see from the examples), we would have to map from the iNat accepted name to the iNat synonym then see if that matches one of our names. That seems like an extra complexity vs just having a simple mapping table that lists iNat names to the UKSI equivalent.

robin-hutchinson commented 1 year ago

Thanks John - I think you are right - I was just worried about iNaturalist updating in future without us being aware so the new iNat to UKSI would not be included, but this would be covered by the sandbox list that we can periodically check for missing synonyms?

johnvanbreda commented 1 year ago

Yes, there is already a log table that lists the failures. We can easily extract the list of non-matched names from this log.

robin-hutchinson commented 1 year ago

Brilliant, thank you, that will be perfect then!

kitenetter commented 1 year ago

@robin-hutchinson and @kitenetter to review taxon-match table following update of UKSI within Indicia, then provide table to John.

kitenetter commented 1 year ago

@robin-hutchinson can you run a check against the updated Indicia table and let me see the results.

robin-hutchinson commented 1 year ago

No problem - the code is running now, I'll email you the generated files

kitenetter commented 1 year ago

@robin-hutchinson I started looking at the tables of un-matched names but it looks like the totals given under "observations_count" are for iNaturalist globally. Is it possible to re-run the query for the UK?

robin-hutchinson commented 1 year ago

Of course - I've got that column now, I'll just email them over.

kitenetter commented 1 year ago

@robin-hutchinson this version takes in the plant names that @Sam-Amy matched, plus a few more from the UKSI matching tool and other detective work. We now have matches for 163 of the 851 unmatched list, including almost all for taxa with over 100 records. Most of the missing names have fewer than 10 records each.

A few matched names appear to be identical to the iNat originals so I'm not sure why they weren't being recognised on import to iRecord.

Suggest we go with the 163 names that we can deal with, and some of the others may get picked up in future UKSI updates.

iNat UKSI No Match_MH.xlsx

johnvanbreda commented 1 year ago

Is this ready for me to work on setting up the matching?

kitenetter commented 1 year ago

@robin-hutchinson can you combine the 163 names from the latest spreadsheet into your longer list of matched names and pass that on to @johnvanbreda

robin-hutchinson commented 1 year ago

All done - I'll reassign this to John

johnvanbreda commented 1 year ago

I've dropped code for this for a new mappings table into the develop branch. Will need to populate this with the contents of the spreadsheet manually once deployed to live.

johnvanbreda commented 11 months ago

I'm having a few issues with the mappings table @robin-hutchinson. What I need to map each row to in the database is a single taxon name on the UKSI list, ideally I dont want to just use the preferred name for the rTVK as that loses detail on what name was actually used. I have a name_mapped field so I can combine that with the rTVK field to select a name from UKSI to use, but I sometimes get duplicates mostly due to variations in authority strings, but I can get round that by just picking one and preferring names that are accepted. But I also have names that don't match at all - e.g. for Taraxacum officinale the mapped name is given as "Taraxacum officinale F.H.Wigg. s.s." but we only have "Taraxacum officinale agg." with or without an authority of Weber. There are approx. 100 names that don't easily match up.

Is it possible to add a TVK for the mapped name to the spreadsheet? That would make the mapping much simpler.

robin-hutchinson commented 11 months ago

Hi John - sure I will work on that and post the updated table here, thanks!

Sam-Amy commented 11 months ago

That sounds suspiciously like the number of names I matched using the BSBI DDb taxon parser, so perhaps it doesn't work quite as I thought. I will have a look too @robin-hutchinson if you could give me a copy of the final spreadsheet please?

kitenetter commented 10 months ago

@robin-hutchinson can you check whether the translation table includes iNat's "Helix lucorum", which needs to be matched to the UKSI "Helix (Helix) lucorum". (We've had an enquiry from Conch Soc.)

kitenetter commented 10 months ago

@robin-hutchinson another query relating to Conch Soc - please can you check that the translation table is handling slug names correctly as detailed in #1154