Create a concatenated file and split it into parts - Githubissues

Rfam / rfam-production

Rfam production pipeline

Apache License 2.0

5 stars 3 forks source link

Create a concatenated file and split it into parts #88

Open AntonPetrov opened 2 years ago

AntonPetrov commented 2 years ago

Once all genomes corresponding to reference proteomes are stored on disk (see #87), they need to be processed into the Rfamseq file structure that is expected by the Rfam pipeline.

To check the current location of Rfamseq, look in the $RFAM_CONFIG file as the rfamprod user. The data is stored in several files:

rfamseq_14_3.fa - all sequences concatenated in 1 giant file
rfamseq_14_3.fa.ssi - a binary index file to speed up sequence retrieval using esl-sfetch
100 files named r100_rfamseq14_3_XXX.fa.gz - sequences split into 100 roughly equally sized files to parallelise searches
10 files named rev-rfamseq14_3_1.fa.gz - 10% of randomly selected sequences from rfamseq that are reversed. They act as negative control because these sequences are random and are not expected to contain real RNA sequences (unless they are palindromic).

We need a new NextFlow pipeline that will take the input from #87 and generate these files automatically. See previous documentation for specific commands.

update seqdb section of $RFAM_CONFIG
- dbsize is very important - if it is calculated incorrectly, all E-values reported to curators and users will be wrong

Relevant existing code: