Closed BurkovBA closed 6 years ago
How often do we need to run this? If it is something we should run when we import E! data I would prefer to add it to the pipeline as part of the Ensembl update. pgloader supports pulling from a mysql database into a Postgres one: http://pgloader.readthedocs.io/en/latest/ref/mysql.html.
This would need to run every time Ensembl is updated so it's a good idea to merge this script with the Ensembl import pipeline.
Not sure if pgloader can help here because we need to pull data from several tables across multiple Ensembl databases.
Ok, I can work on adding it as part of the import pipeline later then. I'll aim for after I update Ensembl data for this release.
Currently, genome handling logic is all around and we're using crutches to do format conversions. This chaos needs to go.
I suggest the following roadmap:
I'll create a github issue with hyperlinks for Anton and Blake to quickly recap.
Anton and Blake, using the hyperlinks I provided, refresh in their memory this whole problem and come up with their visions of:
We do a short meeting and agree on what formats we're using for genome names in each part of our site. I create meeting notes that will serve as a documentation prototype.
Using meeting notes, I document the formats used to store data and pipelines of data transfer. I make this documentation available and we maintain this documentation up-to-date.
Following the documentation, we create one and only data flow with well-defined interfaces and adapter functions for convertions between formats. This pipeline describes is used:
We can download all the available genomes from E! public MySQL database into our own database table.
Then we can get rid of
config/genomes.py
and similar code on frontend. Expose genomes through REST api endpoint.This script is an example of how to retrieve genomes information from E! public MySQL database. https://github.com/RNAcentral/rnacentral-webcode/blob/master/rnacentral/portal/management/commands/update_ensembl_genome_mapping.py
We also have multiple functions, tied to genomes, such as
Xref.get_ucsc_db_id
,Xref.get_ensembl_division()
,Accession.get_ensembl_species_url()
.