RNAcentral / rnacentral-sequence-search

RNAcentral sequence search cloud infrastructure
https://rnacentral.org/sequence-search
Apache License 2.0
2 stars 1 forks source link

Set the Z value dynamically according to the database used #103

Open carlosribas opened 4 years ago

carlosribas commented 4 years ago

If someone searches just in miRBase, it should be miRBase-specific

carlosribas commented 4 years ago

Hi @blakesweeney. Just for the record, I added the esl-seqstat command to rnacentral-import-pipeline. The idea is to put this file somewhere where I can download and parse the results.

carlosribas commented 4 years ago

Hey @blakesweeney! There is a problem running the esl-seqstat command in pdbe:

$ esl-seqstat pdbe-0.fasta Parse failed (sequence file pdbe-0.fasta): Line 6316: illegal character F

We also have this F character on lines 7466 and 12603. Any suggestions on how to solve this without being manually?

blakesweeney commented 4 years ago

Without looking at those sequences, I'm betting they are tRNA and the F character is the amino acid on it. There are likely other cases with different characters as well. The easiest thing to do would be exclude those sequences from search, but I'm not sure that is a good idea. Another choice is to strip those characters off the sequence, which has other possible issues. I'd lean toward doing a very crude modification of the sequences to strip off things that are not ACGU, from the end of tRNA sequences only, but that is something that @AntonPetrov would need to weigh in on.

AntonPetrov commented 4 years ago

This is not a new problem: in previous releases we generated a special fasta file for the old search (the _excluded file contained all the exceptional sequences): http://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/13.0/sequences/.internal/

Is it possible to continue excluding some sequences from sequence search as before?

blakesweeney commented 4 years ago

Sure, we can exclude them like we do currently. I'll add that filtering step to this export as well.