Rfam / rfam-production

Rfam production pipeline
Apache License 2.0
5 stars 3 forks source link

Create a concatenated file and split it into parts #88

Open AntonPetrov opened 2 years ago

AntonPetrov commented 2 years ago

Once all genomes corresponding to reference proteomes are stored on disk (see #87), they need to be processed into the Rfamseq file structure that is expected by the Rfam pipeline.

To check the current location of Rfamseq, look in the $RFAM_CONFIG file as the rfamprod user. The data is stored in several files:

We need a new NextFlow pipeline that will take the input from #87 and generate these files automatically. See previous documentation for specific commands.

Relevant existing code: