Rfam uses the UniProt collection of reference proteomes to get a set of complete, non-redundant, and representative genomes to analyse. Together, these genomes form Rfamseq, the Rfam sequence database that is searched when we make families.
For each proteome, there is an XML file with a link to the genome assembly hosted at ENA, Ensembl, or NCBI. For example, UP000318527 corresponds to the GCA_007827375.1 assembly.
We need to create a new NextFlow pipeline to perform the following steps:
Download the corresponding genome for each proteome
I suggest using NCBI APIs for sequence retrieval - see efetch documentation. Use API_KEY to get higher API access limits.
In the past we used the official enaBrowserTools with mixed success
Note that WGS records may require special handling
The resulting data will have tens of thousands of files and will be quite large (~0.5 Tb?), so it needs to be stored efficiently on disk. For example, the fasta file for UP000318527 can be stored under in the following folder structure UP00/03/18/527. Consider storing gzipped?
Run validation
Some sequences may not be fetched or may be fetched partially. The pipeline should check the total length of each downloaded genome and compare it with the expected genome length (we used esl-seqstat in the past.
It may be possible to re-use some existing Python code:
Rfam uses the UniProt collection of reference proteomes to get a set of complete, non-redundant, and representative genomes to analyse. Together, these genomes form Rfamseq, the Rfam sequence database that is searched when we make families.
Currently UniProt contains just over 20K reference proteomes: https://www.uniprot.org/proteomes/?query=*&fil=reference%3Ayes
For each proteome, there is an XML file with a link to the genome assembly hosted at ENA, Ensembl, or NCBI. For example, UP000318527 corresponds to the GCA_007827375.1 assembly.
We need to create a new NextFlow pipeline to perform the following steps:
UP00/03/18/527
. Consider storing gzipped?It may be possible to re-use some existing Python code:
genome
in the file nameBonus points: Use pre-release UniProt reference proteomes to get early access to the latest set of proteomes/genomes. Check with Dushi how to do it.