Download genomes based on UniProt reference proteomes

Rfam uses the UniProt collection of reference proteomes to get a set of complete, non-redundant, and representative genomes to analyse. Together, these genomes form Rfamseq, the Rfam sequence database that is searched when we make families.

Currently UniProt contains just over 20K reference proteomes: https://www.uniprot.org/proteomes/?query=*&fil=reference%3Ayes

For each proteome, there is an XML file with a link to the genome assembly hosted at ENA, Ensembl, or NCBI. For example, UP000318527 corresponds to the GCA_007827375.1 assembly.

We need to create a new NextFlow pipeline to perform the following steps:

Fetch a list of reference proteomes from UniProt
Download the corresponding genome for each proteome
- I suggest using NCBI APIs for sequence retrieval - see efetch documentation. Use API_KEY to get higher API access limits.
- In the past we used the official enaBrowserTools with mixed success
- Note that WGS records may require special handling
- The resulting data will have tens of thousands of files and will be quite large (~0.5 Tb?), so it needs to be stored efficiently on disk. For example, the fasta file for UP000318527 can be stored under in the following folder structure UP00/03/18/527. Consider storing gzipped?
Run validation
- Some sequences may not be fetched or may be fetched partially. The pipeline should check the total length of each downloaded genome and compare it with the expected genome length (we used esl-seqstat in the past.

It may be possible to re-use some existing Python code:

genome_validation.py
genome_search_utils.py
genome_size_calculator.py
code under https://github.com/Rfam/rfam-production/tree/master/pipelines
any other code with genome in the file name

Bonus points: Use pre-release UniProt reference proteomes to get early access to the latest set of proteomes/genomes. Check with Dushi how to do it.

Rfam / rfam-production

Download genomes based on UniProt reference proteomes #87