Refactor genomes.py duplication

BurkovBA commented 6 years ago

Currently, genome handling logic is all around and we're using crutches to do format conversions. This chaos needs to go.

I suggest the following roadmap:

I'll look into the genomes-related code in :
- database
- python and django models and serializers methods logic
- data import pipelines
- urls
- Genoverse genome-browser
- text search and Lucene index
- user-readable representations on website (backend and frontend-generated)
- hyperlinks generation for external resources (E!, UCSC, ...)

I'll create a github issue with hyperlinks for Anton and Blake to quickly recap.

Anton and Blake, using the hyperlinks I provided, refresh in their memory this whole problem and come up with their visions of:
- how this should be done
- how to get from where we are to where we need to be ASAP
We do a short meeting and agree on what formats we're using for genome names in each part of our site. I create meeting notes that will serve as a documentation prototype.
Using meeting notes, I document the formats used to store data and pipelines of data transfer. I make this documentation available and we maintain this documentation up-to-date.
Following the documentation, we create one and only data flow with well-defined interfaces and adapter functions for convertions between formats. This pipeline describes is used:

by data import pipelines to transport data import pipeline from external sources to the database and python code
by backend code to retrieve data from DB to python/django models
by various frontend modules to request genomes form backend
by various frontend modules to display data

We rewrite our code to use this pipeline and remove any duplications of logic and ad-hoc code.

We can download all the available genomes from E! public MySQL database into our own database table.

Then we can get rid of config/genomes.py and similar code on frontend. Expose genomes through REST api endpoint.

This script is an example of how to retrieve genomes information from E! public MySQL database. https://github.com/RNAcentral/rnacentral-webcode/blob/master/rnacentral/portal/management/commands/update_ensembl_genome_mapping.py

We also have multiple functions, tied to genomes, such as Xref.get_ucsc_db_id, Xref.get_ensembl_division(), Accession.get_ensembl_species_url().

blakesweeney commented 6 years ago

How often do we need to run this? If it is something we should run when we import E! data I would prefer to add it to the pipeline as part of the Ensembl update. pgloader supports pulling from a mysql database into a Postgres one: http://pgloader.readthedocs.io/en/latest/ref/mysql.html.

AntonPetrov commented 6 years ago

This would need to run every time Ensembl is updated so it's a good idea to merge this script with the Ensembl import pipeline.

Not sure if pgloader can help here because we need to pull data from several tables across multiple Ensembl databases.

blakesweeney commented 6 years ago

Ok, I can work on adding it as part of the import pipeline later then. I'll aim for after I update Ensembl data for this release.

RNAcentral / rnacentral-webcode

Refactor genomes.py duplication #321