Once all genomes corresponding to reference proteomes are stored on disk (see #87), they need to be processed into the Rfamseq file structure that is expected by the Rfam pipeline.
To check the current location of Rfamseq, look in the $RFAM_CONFIG file as the rfamprod user. The data is stored in several files:
rfamseq_14_3.fa - all sequences concatenated in 1 giant file
rfamseq_14_3.fa.ssi - a binary index file to speed up sequence retrieval using esl-sfetch
100 files named r100_rfamseq14_3_XXX.fa.gz - sequences split into 100 roughly equally sized files to parallelise searches
10 files named rev-rfamseq14_3_1.fa.gz - 10% of randomly selected sequences from rfamseq that are reversed. They act as negative control because these sequences are random and are not expected to contain real RNA sequences (unless they are palindromic).
We need a new NextFlow pipeline that will take the input from #87 and generate these files automatically. See previous documentation for specific commands.
update seqdb section of $RFAM_CONFIG
dbsize is very important - if it is calculated incorrectly, all E-values reported to curators and users will be wrong
Once all genomes corresponding to reference proteomes are stored on disk (see #87), they need to be processed into the Rfamseq file structure that is expected by the Rfam pipeline.
To check the current location of Rfamseq, look in the
$RFAM_CONFIG
file as therfamprod
user. The data is stored in several files:rfamseq_14_3.fa
- all sequences concatenated in 1 giant filerfamseq_14_3.fa.ssi
- a binary index file to speed up sequence retrieval usingesl-sfetch
r100_rfamseq14_3_XXX.fa.gz
- sequences split into 100 roughly equally sized files to parallelise searchesrev-rfamseq14_3_1.fa.gz
- 10% of randomly selected sequences from rfamseq that are reversed. They act as negative control because these sequences are random and are not expected to contain real RNA sequences (unless they are palindromic).We need a new NextFlow pipeline that will take the input from #87 and generate these files automatically. See previous documentation for specific commands.
seqdb
section of$RFAM_CONFIG
dbsize
is very important - if it is calculated incorrectly, all E-values reported to curators and users will be wrongRelevant existing code: