EnquistLab / RTNRS

R package for the (plant) Taxonomic Name Resolution Service
https://bien.nceas.ucsb.edu/bien/tools/tnrs/
Other
8 stars 0 forks source link

Undocumented behaviour: duplicated input names have their IDs lumped together #15

Open Rekyt opened 1 year ago

Rekyt commented 1 year ago

I found an undocumented behavior of the package, and as such, I wasn't expecting it. While I understand from an API standpoint why it is important to avoid unnecessary duplicate queries and save resources.

However, TNRS doesn't document the fact that same input names are going to be lumped together in the query. It would be nice to document this behavior to avoid any bad surprises when using the ID columns to make joins after matching names.

reprex:

# Test twice the same name with different
taxa_frame = data.frame(
  ID = paste0("test-", 1:2),
  name = c("Helianthus", "Helianthus")
)

matched = TNRS::TNRS(taxa_frame)

# IDs are mixed
matched[, 1:5]
#>              ID Name_submitted Overall_score Name_matched_id Name_matched
#> 1 test-2,test-1     Helianthus             1          668749   Helianthus

# It's the same for sequential match
seq_match = TNRS::TNRS(taxa_frame$name)
seq_match[, 1:5]
#>    ID Name_submitted Overall_score Name_matched_id Name_matched
#> 1 2,1     Helianthus             1          668749   Helianthus

Created on 2023-02-14 with reprex v2.0.2

ojalaquellueva commented 1 year ago

Hi @Rekyt. This is known behavior, related to how the perl parallelization module combines duplicate names in the same batch. I've never been a fan of this feature myself. The rather minor performance gains of combining duplicate names into a single request are outweighed by the complexity of dealing with concatenated IDs.

When querying the API directly, I add a bit of post-processing code which detects and splits responses with concatenated IDs into separate records. I realize this is a bit complex for many users, but I am hesitant to make fundamental changes to the behavior of the API itself, which would be disruptive for applications which hit the API directly (including some internal BIEN applications).

@bmaitner what do you think about adding post-processing code, similar to what I describe above, to the R package to split concatenated IDs? Perhaps controllable via a parameter for users who have already written code to handle concatenated IDs and wish to keep the original behavior? Alternatively, you could simply mention this behavior in the RBIEN documentation and let users handle it themselves. Let me know what you think.

Rekyt commented 1 year ago

Thanks for the quick answer @ojalaquellueva! I understand why it's the case now, and would advise, for the sake of simplicity, to simply document the behavior.

I actually didn't realize until today (!) that the IDs were pasted together. And it got me when trying to merge submitted names to output through the ID column.

It took me few lines of code to handle the concatenated IDs so it shouldn't be a major problem. Maybe provide a specific function to use after TNRS()?

alrichardbollans commented 10 months ago

Slightly unrelated, but where do the Name_matched_ids come from? They don't seem to be from the source (e.g. WCVP). Is there a way to extract the source ID (e.g. Rosa inermis Turra is given a Name_matched_id of 100178, but its id in WCVP is 2970427)

ojalaquellueva commented 10 months ago

Slightly unrelated, but where do the Name_matched_ids come from? They don't seem to be from the source (e.g. WCVP). Is there a way to extract the source ID (e.g. Rosa inermis Turra is given a Name_matched_id of 100178, but its id in WCVP is 2970427)

@alrichardbollans Name_matched_id is an internal integer identifier, guaranteed to be unique within the TNRS database. This is a helpful comment for two reasons. One, it reminds me that we need to make the original source database identifiers available in all cases (we do not to this consistently) and two, it provides a clue that the faulty identifiers in the POWO hyperlinks may be due to a mix-up between the source-specific (POWO) identifiers and the TNRS identifiers.