EddyRivasLab / easel

Sequence analysis library used by Eddy/Rivas lab code
Other
46 stars 26 forks source link

primary keys not unique: 'AABF01000026.1/2165-2294' occurs more than once #67

Closed arslan9732 closed 1 year ago

arslan9732 commented 1 year ago

Hi, I'm trying to make an index of Rfam.fa file using the following command: esl-sfetch --index Rfam.fa

but I got this error:

Failed to write keys to ssi file Rfam.fa.ssi:
  primary keys not unique: 'AABF01000026.1/2165-2294' occurs more than once

Can you please help me to resolve this issue?

traviswheeler commented 1 year ago

The error message is telling you that your fasta file contains at least two sequences with the name "AABF01000026.1/2165-2294". Names need to be unique in order for esl-sfetch to be able to index/search the file.

arslan9732 commented 1 year ago

But I am using the Rfam.fa file from the current release of Rfam. So is Rfam database contains redundant names of the sequences? If yes then there will be the possibility of redundancy with more than one.

traviswheeler commented 1 year ago

Ah - I'll give another quick response, but maybe Eddy lab folks will have a different take:

I can't be sure what's in the file you're trying to search (Rfam.fa).

In any case, it looks like some rna sequences appear in multiple Rfam families. For example, "AF311056.1/10510-10592 " is found in both RF03536 and RF03547. That seems undesirable to me ... but maybe there's some reason that makes sense to Rfam developers? What this means is that there will be some sequences that appear more than once in the .seed file, or in a file made by concatenating all Rfam .fasta files. The result: esl-sfetch can't deal with them, because not all sequences in the file have a unique name.

I don't know your use case, so I'm not sure exactly what steps you should take ... but I do know that you'll need to somehow remove replicates if you're going to use indexing/search tools on the sequence set.

ppgardne commented 1 year ago

Looks like those families should either be merged, or in a clan (probably merged based on the names). In other news, is the Rfam website super slow for anyone else, or is that just because I'm on the other side of the planet?

AntonPetrov commented 1 year ago

Tagging @blakesweeney and @emmaco from @RfamDB as they are best placed to comment about duplicates in Rfam.fa.

blakesweeney commented 1 year ago

Hi, thank you for reaching out about this. We are currently regenerating the Rfam.fa file without any duplicates and will let you know once it is ready.

arslan9732 commented 1 year ago

@AntonPetrov @blakesweeney Thank you. I will be waiting.

emmaco commented 1 year ago

Hi there! The updated file is available now http://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/fasta_files/Rfam.fa.gz Please let me know if you have any further issues.

arslan9732 commented 1 year ago

Yeah, it works. Thank you.

blakesweeney commented 1 year ago

Just a quick update, we noticed the Rfam.fa file was incorrect after deduplicating but have since fixed it. Please use the latest version.