RNAcentral / rnacentral-webcode

RNAcentral website source code
https://rnacentral.org
Apache License 2.0
31 stars 8 forks source link

Duplicate entries in text search #557

Closed AntonPetrov closed 2 years ago

AntonPetrov commented 2 years ago

While investigating a text search results export issue reported by a user, I discovered that some entries exist in the production text search index twice, for example:

https://www.ebi.ac.uk/ebisearch/ws/rest/rnacentral?query=URS00003F222E_9606 https://rnacentral.org/search?q=URS00003F222E_9606

I confirmed that the entry is found in the XML dumps twice:

[nightly]$ zgrep URS00003F222E_9606 *.xml.gz
xml4dbdumps__8550001__8600001.xml.gz
xml4dbdumps__8600001__8650001.xml.gz

In the dev search index the entry occurs only once: https://wwwdev.ebi.ac.uk/ebisearch/ws/rest/rnacentral?query=URS00003F222E_9606

As this URS ID is present multiple times in the EBI text search results, it is listed twice in a file used by esl-sfetch for extracting FASTA sequences. Esl-sfetch extracts only those sequences that occur before the duplicate ID and then quits (but not crashes). As a result, an incomplete FASTA file is served to the user.

I will modify the webcode to deduplicate the IDs and the export results will be more accurate, but this needs to be fixed in the text search index as presumably there are other duplicates.

blakesweeney commented 2 years ago

Appears to be fixed in the latest release.