Open Rekyt opened 1 year ago
Hi @Rekyt. This is known behavior, related to how the perl parallelization module combines duplicate names in the same batch. I've never been a fan of this feature myself. The rather minor performance gains of combining duplicate names into a single request are outweighed by the complexity of dealing with concatenated IDs.
When querying the API directly, I add a bit of post-processing code which detects and splits responses with concatenated IDs into separate records. I realize this is a bit complex for many users, but I am hesitant to make fundamental changes to the behavior of the API itself, which would be disruptive for applications which hit the API directly (including some internal BIEN applications).
@bmaitner what do you think about adding post-processing code, similar to what I describe above, to the R package to split concatenated IDs? Perhaps controllable via a parameter for users who have already written code to handle concatenated IDs and wish to keep the original behavior? Alternatively, you could simply mention this behavior in the RBIEN documentation and let users handle it themselves. Let me know what you think.
Thanks for the quick answer @ojalaquellueva! I understand why it's the case now, and would advise, for the sake of simplicity, to simply document the behavior.
I actually didn't realize until today (!) that the IDs were pasted together. And it got me when trying to merge submitted names to output through the ID column.
It took me few lines of code to handle the concatenated IDs so it shouldn't be a major problem. Maybe provide a specific function to use after TNRS()
?
Slightly unrelated, but where do the Name_matched_ids come from? They don't seem to be from the source (e.g. WCVP). Is there a way to extract the source ID (e.g. Rosa inermis Turra is given a Name_matched_id of 100178, but its id in WCVP is 2970427)
Slightly unrelated, but where do the Name_matched_ids come from? They don't seem to be from the source (e.g. WCVP). Is there a way to extract the source ID (e.g. Rosa inermis Turra is given a Name_matched_id of 100178, but its id in WCVP is 2970427)
@alrichardbollans Name_matched_id is an internal integer identifier, guaranteed to be unique within the TNRS database. This is a helpful comment for two reasons. One, it reminds me that we need to make the original source database identifiers available in all cases (we do not to this consistently) and two, it provides a clue that the faulty identifiers in the POWO hyperlinks may be due to a mix-up between the source-specific (POWO) identifiers and the TNRS identifiers.
I found an undocumented behavior of the package, and as such, I wasn't expecting it. While I understand from an API standpoint why it is important to avoid unnecessary duplicate queries and save resources.
However,
TNRS
doesn't document the fact that same input names are going to be lumped together in the query. It would be nice to document this behavior to avoid any bad surprises when using the ID columns to make joins after matching names.reprex:
Created on 2023-02-14 with reprex v2.0.2