Replace reference genome field with TaxID organism

afrubin commented 2 years ago

We should remove the reference genome field and replace it with an entry from the NCBI Taxonomy resource.

The relevant information to store in MaveDB is:

Taxonomy ID
Species name
Common name

We will want to prepopulate the database with relevant Taxonomy IDs and also allow users to enter a different valid Taxonomy ID if needed.

When a new Taxonomy ID is entered, the backend should fetch the details using an eutils API call and display the organism name in the form preview. We can create a new entry in the database once the dataset is submitted. This information can be retrieved using the NCBI Datasets REST API.

The field in the website UI should be able to autocomplete based on the current name, common name, or TaxID. The API should accept the current name, common name, or TaxID as exact matches (see the entry for human as an example - an API call should be able to provide "human" instead of "9606").

For synthetic sequences, we need to find out if Taxonomy has a suitable term for this, or if we need to define our own special term that's MaveDB-specific.

We can handle this issue in two steps:

[ ] Replace the Reference Genome model with a Target Organism model
[ ] Add support for adding new Target Organisms

afrubin commented 2 years ago

Here is the example API output from requesting TaxID 7227. The query URL was https://api.ncbi.nlm.nih.gov/datasets/v2alpha/taxonomy/taxon/7227. As we can see this is fairly straightforward.

  "taxonomy_nodes": [
    {
      "query": [
        "7227"
      ],
      "taxonomy": {
        "tax_id": 7227,
        "organism_name": "Drosophila melanogaster",
        "common_name": "fruit fly",
        "lineage": [
          1,
          131567,
          2759,
          33154,
          33208,
          6072,
          33213,
          33317,
          1206794,
          88770,
          6656,
          197563,
          197562,
          6960,
          50557,
          85512,
          7496,
          33340,
          33392,
          7147,
          7203,
          43733,
          480118,
          480117,
          43738,
          43741,
          43746,
          7214,
          43845,
          46877,
          7215,
          32341,
          32346,
          32351
        ],
        "rank": "SPECIES",
        "has_described_species_name": true,
        "counts": [
          {
            "type": "COUNT_TYPE_GENE",
            "count": 17868
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 13968
          },
          {
            "type": "COUNT_TYPE_PSEUDO",
            "count": 310
          },
          {
            "type": "COUNT_TYPE_miscRNA",
            "count": 5
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 3132
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 134
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 319
          },
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 67
          }
        ]
      }
    }
  ]
}

afrubin commented 1 year ago

An additional clarification here is that this change should also get rid of the 'reference_maps' list, since that's quite clumsy and each target should only have one TaxID.

afrubin commented 1 year ago

@EstelleDa Here are files with the tax_id structures we want to store in the database (as JSON) and the tax_ids for each published score set target in the database. Note that many of the tax_ids we want to pre-load are not in published score sets, but are found in datasets that will be uploaded soon. published_score_set_tax_ids.csv mavedb_taxon_objects.json.txt

jstone-dev commented 6 months ago

Released in version 2024.0.0 on 2024-04-15.

VariantEffect / mavedb-api

Replace reference genome field with TaxID organism #10