Rfam / rfam-production

Rfam production pipeline
Apache License 2.0
5 stars 3 forks source link

Download genomes based on UniProt reference proteomes #87

Open AntonPetrov opened 2 years ago

AntonPetrov commented 2 years ago

Rfam uses the UniProt collection of reference proteomes to get a set of complete, non-redundant, and representative genomes to analyse. Together, these genomes form Rfamseq, the Rfam sequence database that is searched when we make families.

Currently UniProt contains just over 20K reference proteomes: https://www.uniprot.org/proteomes/?query=*&fil=reference%3Ayes

For each proteome, there is an XML file with a link to the genome assembly hosted at ENA, Ensembl, or NCBI. For example, UP000318527 corresponds to the GCA_007827375.1 assembly.

We need to create a new NextFlow pipeline to perform the following steps:

It may be possible to re-use some existing Python code:

Bonus points: Use pre-release UniProt reference proteomes to get early access to the latest set of proteomes/genomes. Check with Dushi how to do it.