Rendering filament pages using NCBI dataset API

nekrut commented 2 weeks ago

This issue illustrates how NCBI Datasets API can be used to generates JSON blobs necessary for rendering filament pages (https://github.com/galaxyproject/brc-analytics/issues/130).

Linked Tickets

[ ] Doing API script in #159
[ ] Genomes list exploration in #177

Non viral data

For initial set of taxa will be limited to these species: https://docs.google.com/spreadsheets/d/1Gg9sw2Qw765tOx2To53XkTAn-RAMiBtqYrfItlLXXrc/edit?usp=sharing

List view

The following API call is used:

curl -X POST "https://api.ncbi.nlm.nih.gov/datasets/v2/taxonomy/dataset_report" \
 -H 'accept: application/json'\
 -H 'content-type: application/json' \
 -d '{"taxons":["Plasmodium falciparum","Plasmodium vivax","Plasmodium yoelii","Plasmodium vinckei","Culex pipiens","Anopheles gambiae","Toxoplasma gondii","Mycobacterium tuberculosis","Coccidioides posadasii","Coccidioides immitis"],"children":false,"ranks":["genus"]}'

THis generates the following response:

{
  "reports": [
    {
      "taxonomy": {
        "tax_id": 7165,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Anopheles gambiae",
          "authority": "Giles, 1902"
        },
        "curator_common_name": "African malaria mosquito",
        "group_name": "mosquitos",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "kingdom": {
            "name": "Metazoa",
            "id": 33208
          },
          "phylum": {
            "name": "Arthropoda",
            "id": 6656
          },
          "class": {
            "name": "Insecta",
            "id": 50557
          },
          "order": {
            "name": "Diptera",
            "id": 7147
          },
          "family": {
            "name": "Culicidae",
            "id": 7157
          },
          "genus": {
            "name": "Anopheles",
            "id": 7164
          },
          "species": {
            "name": "Anopheles gambiae",
            "id": 7165
          }
        },
        "parents": [
          1,
          131567,
          2759,
          33154,
          33208,
          6072,
          33213,
          33317,
          1206794,
          88770,
          6656,
          197563,
          197562,
          6960,
          50557,
          85512,
          7496,
          33340,
          33392,
          7147,
          7148,
          43786,
          41827,
          7157,
          43816,
          7164,
          44534,
          44537,
          44542
        ],
        "children": [
          180454
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 7
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 15164
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 422
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 615
          },
          {
            "type": "COUNT_TYPE_snRNA",
            "count": 27
          },
          {
            "type": "COUNT_TYPE_snoRNA",
            "count": 11
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 12518
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 1209
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Anopheles gambiae"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5501,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Coccidioides immitis",
          "authority": "G.W. Stiles, 1896"
        },
        "group_name": "ascomycete fungi",
        "has_type_material": true,
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "kingdom": {
            "name": "Fungi",
            "id": 4751
          },
          "phylum": {
            "name": "Ascomycota",
            "id": 4890
          },
          "class": {
            "name": "Eurotiomycetes",
            "id": 147545
          },
          "order": {
            "name": "Onygenales",
            "id": 33183
          },
          "family": {
            "name": "Onygenaceae",
            "id": 33184
          },
          "genus": {
            "name": "Coccidioides",
            "id": 5500
          },
          "species": {
            "name": "Coccidioides immitis",
            "id": 5501
          }
        },
        "parents": [
          1,
          131567,
          2759,
          33154,
          4751,
          451864,
          4890,
          716545,
          147538,
          716546,
          147545,
          451871,
          33183,
          33184,
          5500
        ],
        "children": [
          246410,
          454286,
          404692,
          396776
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 5
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 9974
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 147
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 29
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 9797
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 1
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Coccidioides immitis"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 199306,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Coccidioides posadasii",
          "authority": "M.C. Fisher, G.L. Koenig, T.J. White & J.W. Taylor, 2002"
        },
        "group_name": "ascomycete fungi",
        "has_type_material": true,
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "kingdom": {
            "name": "Fungi",
            "id": 4751
          },
          "phylum": {
            "name": "Ascomycota",
            "id": 4890
          },
          "class": {
            "name": "Eurotiomycetes",
            "id": 147545
          },
          "order": {
            "name": "Onygenales",
            "id": 33183
          },
          "family": {
            "name": "Onygenaceae",
            "id": 33184
          },
          "genus": {
            "name": "Coccidioides",
            "id": 5500
          },
          "species": {
            "name": "Coccidioides posadasii",
            "id": 199306
          }
        },
        "parents": [
          1,
          131567,
          2759,
          33154,
          4751,
          451864,
          4890,
          716545,
          147538,
          716546,
          147545,
          451871,
          33183,
          33184,
          5500
        ],
        "children": [
          443226,
          469471
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 13
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 8510
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 163
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 2
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 8342
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 1
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Coccidioides posadasii"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 7175,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Culex pipiens",
          "authority": "Linnaeus, 1758"
        },
        "curator_common_name": "northern house mosquito",
        "group_name": "mosquitos",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "kingdom": {
            "name": "Metazoa",
            "id": 33208
          },
          "phylum": {
            "name": "Arthropoda",
            "id": 6656
          },
          "class": {
            "name": "Insecta",
            "id": 50557
          },
          "order": {
            "name": "Diptera",
            "id": 7147
          },
          "family": {
            "name": "Culicidae",
            "id": 7157
          },
          "genus": {
            "name": "Culex",
            "id": 7174
          },
          "species": {
            "name": "Culex pipiens",
            "id": 7175
          }
        },
        "parents": [
          1,
          131567,
          2759,
          33154,
          33208,
          6072,
          33213,
          33317,
          1206794,
          88770,
          6656,
          197563,
          197562,
          6960,
          50557,
          85512,
          7496,
          33340,
          33392,
          7147,
          7148,
          43786,
          41827,
          7157,
          43817,
          53550,
          7174,
          53527,
          518105
        ],
        "children": [
          1833972,
          38569,
          42434,
          233155
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 5
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 19673
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 686
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 155
          },
          {
            "type": "COUNT_TYPE_snRNA",
            "count": 58
          },
          {
            "type": "COUNT_TYPE_snoRNA",
            "count": 9
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 16298
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 1620
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Culex pipiens"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 1773,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Mycobacterium tuberculosis",
          "authority": "(Zopf 1883) Lehmann and Neumann 1896 (Approved Lists 1980)",
          "basionym": {
            "name": "\"Bacterium tuberculosis\"",
            "authority": "Zopf 1883",
            "notes": [
              {
                "name": "Effective Name",
                "note": "This is an effectively published name.",
                "note_classifier": "effective_name"
              }
            ]
          }
        },
        "group_name": "high G+C Gram-positive bacteria",
        "has_type_material": true,
        "classification": {
          "superkingdom": {
            "name": "Bacteria",
            "id": 2
          },
          "kingdom": {
            "name": "Bacillati",
            "id": 1783272
          },
          "phylum": {
            "name": "Actinomycetota",
            "id": 201174
          },
          "class": {
            "name": "Actinomycetes",
            "id": 1760
          },
          "order": {
            "name": "Mycobacteriales",
            "id": 85007
          },
          "family": {
            "name": "Mycobacteriaceae",
            "id": 1762
          },
          "genus": {
            "name": "Mycobacterium",
            "id": 1763
          },
          "species": {
            "name": "Mycobacterium tuberculosis",
            "id": 1773
          }
        },
        "parents": [
          1,
          131567,
          2,
          1783272,
          201174,
          1760,
          85007,
          1762,
          1763,
          77643
        ],
        "children": [
          1427330,
          1427329
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 7819
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 4008
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 45
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 3
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 3906
          },
          {
            "type": "COUNT_TYPE_miscRNA",
            "count": 2
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 20
          },
          {
            "type": "COUNT_TYPE_OTHER",
            "count": 2
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Mycobacterium tuberculosis"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5833,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Plasmodium falciparum"
        },
        "curator_common_name": "malaria parasite P. falciparum",
        "group_name": "apicomplexans",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "phylum": {
            "name": "Apicomplexa",
            "id": 5794
          },
          "class": {
            "name": "Aconoidasida",
            "id": 422676
          },
          "order": {
            "name": "Haemosporida",
            "id": 5819
          },
          "family": {
            "name": "Plasmodiidae",
            "id": 1639119
          },
          "genus": {
            "name": "Plasmodium",
            "id": 5820
          },
          "species": {
            "name": "Plasmodium falciparum",
            "id": 5833
          }
        },
        "parents": [
          1,
          131567,
          2759,
          2698737,
          33630,
          5794,
          422676,
          5819,
          1639119,
          5820,
          418107
        ],
        "children": [
          478864,
          1036723
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 67
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 5618
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 45
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 28
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 5285
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 102
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Plasmodium falciparum"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5860,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Plasmodium vinckei",
          "authority": "(Rodhain, 1952)"
        },
        "group_name": "apicomplexans",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "phylum": {
            "name": "Apicomplexa",
            "id": 5794
          },
          "class": {
            "name": "Aconoidasida",
            "id": 422676
          },
          "order": {
            "name": "Haemosporida",
            "id": 5819
          },
          "family": {
            "name": "Plasmodiidae",
            "id": 1639119
          },
          "genus": {
            "name": "Plasmodium",
            "id": 5820
          },
          "species": {
            "name": "Plasmodium vinckei",
            "id": 5860
          }
        },
        "parents": [
          1,
          131567,
          2759,
          2698737,
          33630,
          5794,
          422676,
          5819,
          1639119,
          5820,
          418101
        ],
        "children": [
          54757,
          138298,
          138297,
          119398
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 10
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 5147
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 67
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 11
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 5050
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Plasmodium vinckei"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5855,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Plasmodium vivax",
          "authority": "(Grassi & Feletti, 1890)"
        },
        "curator_common_name": "malaria parasite P. vivax",
        "group_name": "apicomplexans",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "phylum": {
            "name": "Apicomplexa",
            "id": 5794
          },
          "class": {
            "name": "Aconoidasida",
            "id": 422676
          },
          "order": {
            "name": "Haemosporida",
            "id": 5819
          },
          "family": {
            "name": "Plasmodiidae",
            "id": 1639119
          },
          "genus": {
            "name": "Plasmodium",
            "id": 5820
          },
          "species": {
            "name": "Plasmodium vivax",
            "id": 5855
          }
        },
        "parents": [
          1,
          131567,
          2759,
          2698737,
          33630,
          5794,
          422676,
          5819,
          1639119,
          5820,
          418103
        ],
        "children": [
          31273,
          126793,
          1035514,
          1035515,
          882766,
          1077284,
          1033975
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 19
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 5513
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 44
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 22
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 5395
          },
          {
            "type": "COUNT_TYPE_miscRNA",
            "count": 10
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Plasmodium vivax"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5861,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Plasmodium yoelii"
        },
        "group_name": "apicomplexans",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "phylum": {
            "name": "Apicomplexa",
            "id": 5794
          },
          "class": {
            "name": "Aconoidasida",
            "id": 422676
          },
          "order": {
            "name": "Haemosporida",
            "id": 5819
          },
          "family": {
            "name": "Plasmodiidae",
            "id": 1639119
          },
          "genus": {
            "name": "Plasmodium",
            "id": 5820
          },
          "species": {
            "name": "Plasmodium yoelii",
            "id": 5861
          }
        },
        "parents": [
          1,
          131567,
          2759,
          2698737,
          33630,
          5794,
          422676,
          5819,
          1639119,
          5820,
          418101
        ],
        "children": [
          73239,
          1050261,
          31274,
          283801,
          1323249,
          1050262
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 15
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 6233
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 52
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 39
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 6037
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 47
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Plasmodium yoelii"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5811,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Toxoplasma gondii"
        },
        "group_name": "apicomplexans",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "phylum": {
            "name": "Apicomplexa",
            "id": 5794
          },
          "class": {
            "name": "Conoidasida",
            "id": 1280412
          },
          "order": {
            "name": "Eucoccidiorida",
            "id": 75739
          },
          "family": {
            "name": "Sarcocystidae",
            "id": 5809
          },
          "genus": {
            "name": "Toxoplasma",
            "id": 5810
          },
          "species": {
            "name": "Toxoplasma gondii",
            "id": 5811
          }
        },
        "parents": [
          1,
          131567,
          2759,
          2698737,
          33630,
          5794,
          1280412,
          5796,
          75739,
          423054,
          5809,
          5810
        ],
        "children": [
          933077,
          398031
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 29
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 8925
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 183
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 424
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 8318
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Toxoplasma gondii"
      ]
    }
  ],
  "total_count": 10
}

From this response we would like to render the following fields on a page (only showing two rows)

[ ]	Taxon	TaxId	# Assemblies	Tags
[ ]	Anopheles gambiae	7165	7	Vector
[ ]	Coccidioides immitis	5501	5	Fungi

These are populated from:

taxon = (reports -> taxonomy -> current_scientific_name -> name)
taxid = (reports -> taxonomy -> taxid)
# Assemblies = (reports -> taxonomy -> counts[0])
Tag = custom added by us

Genomes page

Now let's suppose on the previous page a clicked both Anopheles gambiae and Coccidioides immitis checkboxes and selected "Go to Genomes" button.

This will be equivalent to passing the following GET request:

https://api.ncbi.nlm.nih.gov/datasets/v2/genome/taxon/7165%2C5501/dataset_report?filters.assembly_source=refseq&filters.has_annotation=true&filters.exclude_paired_reports=true&filters.exclude_atypical=true&filters.assembly_level=scaffold&filters.assembly_level=chromosome&filters.assembly_level=complete_genome

Which will be rendered as the following genome page:

[ ]	Taxon	TaxId	Accession	IsRef	Level	# Chr	Len	# Scaffolds	Scaffold N50	Scaffold L50	Coverage	GC%	Ann Status
[ ]	Anopheles gambiae	7165	GCF_943734735.2	Yes	Chromosome	3	264451381	190	99149756	2	54.0x	44.5	Full annotation

Taxon = organism -> organism_name
TaxId = organism -> tax_id
Accession = accession
IsRef = assembly_info -> refseq_category
Level = assembly_info -> assembly_level
# Chr = ssembly_stats -> total_number_of_chromosomes
Len = assembly_stats -> total_sequence_length
# Scaffolds = assembly_stats -> number_of_scaffolds
Scaffold N50 = assembly_stats -> scaffold_n50
Scaffold L50 =assembly_stats -> scaffold_l50
GC% = assembly_stats -> gc_percent
Annotation status = annotation_info -> status

NoopDog commented 2 weeks ago

Ok thx @nekrut we will start on this and collect the tables from NCBI...

NoopDog commented 2 weeks ago

Also link to UCSC genome browser in the genome file.

d-callan commented 1 week ago

Sry, not sure this is the right place for this comment.. but were it me I'd seriously consider adding some kinetoplastids to that list of initial taxa.

d-callan commented 1 week ago

T. Cruzi T. Brucei Leish major Leish donovoni Leish brazilensis

Those are the ones coming to me off the top of my head, though I feel like that's maybe missing a big leish species or two. I might not have the spelling quite right either.. it'd give you Chagas, African sleeping sickness and iirc all three forms of leish though I need to double check that. Considering the popularity of tritrypdb and the impact of these diseases, these species would be a very notable omission.

Also, pretty sure we now have a few locally acquired cases of mucosal leish in Texas, as the sandfly habitat expands, so there's 'local' relevance.. thanks global warming

nekrut commented 6 days ago

Here is the initial set pf species https://docs.google.com/spreadsheets/d/1Gg9sw2Qw765tOx2To53XkTAn-RAMiBtqYrfItlLXXrc/edit?usp=sharing

(replaces #153 )

hunterckx commented 6 days ago

@nekrut Question -- how can we map the genomes returned by NCBI to the UCSC Browser URLs specified in assemblyList.json? Previously we matched Genome Version/Assembly ID from this genomes spreadsheet with either genBank or refSeq from the assembly list, but I'm not familiar enough with what the fields mean to determine which ID(s) from the NCBI API would be necessary to match with the ones in the assembly list.

Thanks!

nekrut commented 6 days ago

@nekrut Question -- how can we map the genomes returned by NCBI to the UCSC Browser URLs specified in assemblyList.json? Previously we matched Genome Version/Assembly ID from this genomes spreadsheet with either genBank or refSeq from the assembly list, but I'm not familiar enough with what the fields mean to determine which ID(s) from the NCBI API would be necessary to match with the ones in the assembly list.

Thanks!

Good point. They need to be built first. I will initiate process over the weekend. This can happen very quickly, but for now let's not link them to UCSC yet.

galaxyproject / brc-analytics