RNAcentral / rnacentral-webcode

RNAcentral website source code
https://rnacentral.org
Apache License 2.0
32 stars 10 forks source link

Refactor genomes.py duplication #321

Closed BurkovBA closed 6 years ago

BurkovBA commented 6 years ago

Currently, genome handling logic is all around and we're using crutches to do format conversions. This chaos needs to go.

I suggest the following roadmap:

  1. I'll look into the genomes-related code in :
    • database
    • python and django models and serializers methods logic
    • data import pipelines
    • urls
    • Genoverse genome-browser
    • text search and Lucene index
    • user-readable representations on website (backend and frontend-generated)
    • hyperlinks generation for external resources (E!, UCSC, ...)

I'll create a github issue with hyperlinks for Anton and Blake to quickly recap.

  1. Anton and Blake, using the hyperlinks I provided, refresh in their memory this whole problem and come up with their visions of:

    • how this should be done
    • how to get from where we are to where we need to be ASAP
  2. We do a short meeting and agree on what formats we're using for genome names in each part of our site. I create meeting notes that will serve as a documentation prototype.

  3. Using meeting notes, I document the formats used to store data and pipelines of data transfer. I make this documentation available and we maintain this documentation up-to-date.

  4. Following the documentation, we create one and only data flow with well-defined interfaces and adapter functions for convertions between formats. This pipeline describes is used:

  1. We rewrite our code to use this pipeline and remove any duplications of logic and ad-hoc code.

We can download all the available genomes from E! public MySQL database into our own database table.

Then we can get rid of config/genomes.py and similar code on frontend. Expose genomes through REST api endpoint.

This script is an example of how to retrieve genomes information from E! public MySQL database. https://github.com/RNAcentral/rnacentral-webcode/blob/master/rnacentral/portal/management/commands/update_ensembl_genome_mapping.py

We also have multiple functions, tied to genomes, such as Xref.get_ucsc_db_id, Xref.get_ensembl_division(), Accession.get_ensembl_species_url().

blakesweeney commented 6 years ago

How often do we need to run this? If it is something we should run when we import E! data I would prefer to add it to the pipeline as part of the Ensembl update. pgloader supports pulling from a mysql database into a Postgres one: http://pgloader.readthedocs.io/en/latest/ref/mysql.html.

AntonPetrov commented 6 years ago

This would need to run every time Ensembl is updated so it's a good idea to merge this script with the Ensembl import pipeline.

Not sure if pgloader can help here because we need to pull data from several tables across multiple Ensembl databases.

blakesweeney commented 6 years ago

Ok, I can work on adding it as part of the import pipeline later then. I'll aim for after I update Ensembl data for this release.