galaxyproject / brc-analytics

MIT License
0 stars 4 forks source link

Rendering filament pages using NCBI dataset API #157

Open nekrut opened 2 weeks ago

nekrut commented 2 weeks ago

This issue illustrates how NCBI Datasets API can be used to generates JSON blobs necessary for rendering filament pages (https://github.com/galaxyproject/brc-analytics/issues/130).

Linked Tickets

Non viral data

For initial set of taxa will be limited to these species: https://docs.google.com/spreadsheets/d/1Gg9sw2Qw765tOx2To53XkTAn-RAMiBtqYrfItlLXXrc/edit?usp=sharing

List view

Image

The following API call is used:

curl -X POST "https://api.ncbi.nlm.nih.gov/datasets/v2/taxonomy/dataset_report" \
 -H 'accept: application/json'\
 -H 'content-type: application/json' \
 -d '{"taxons":["Plasmodium falciparum","Plasmodium vivax","Plasmodium yoelii","Plasmodium vinckei","Culex pipiens","Anopheles gambiae","Toxoplasma gondii","Mycobacterium tuberculosis","Coccidioides posadasii","Coccidioides immitis"],"children":false,"ranks":["genus"]}' 

THis generates the following response:

{
  "reports": [
    {
      "taxonomy": {
        "tax_id": 7165,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Anopheles gambiae",
          "authority": "Giles, 1902"
        },
        "curator_common_name": "African malaria mosquito",
        "group_name": "mosquitos",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "kingdom": {
            "name": "Metazoa",
            "id": 33208
          },
          "phylum": {
            "name": "Arthropoda",
            "id": 6656
          },
          "class": {
            "name": "Insecta",
            "id": 50557
          },
          "order": {
            "name": "Diptera",
            "id": 7147
          },
          "family": {
            "name": "Culicidae",
            "id": 7157
          },
          "genus": {
            "name": "Anopheles",
            "id": 7164
          },
          "species": {
            "name": "Anopheles gambiae",
            "id": 7165
          }
        },
        "parents": [
          1,
          131567,
          2759,
          33154,
          33208,
          6072,
          33213,
          33317,
          1206794,
          88770,
          6656,
          197563,
          197562,
          6960,
          50557,
          85512,
          7496,
          33340,
          33392,
          7147,
          7148,
          43786,
          41827,
          7157,
          43816,
          7164,
          44534,
          44537,
          44542
        ],
        "children": [
          180454
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 7
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 15164
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 422
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 615
          },
          {
            "type": "COUNT_TYPE_snRNA",
            "count": 27
          },
          {
            "type": "COUNT_TYPE_snoRNA",
            "count": 11
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 12518
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 1209
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Anopheles gambiae"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5501,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Coccidioides immitis",
          "authority": "G.W. Stiles, 1896"
        },
        "group_name": "ascomycete fungi",
        "has_type_material": true,
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "kingdom": {
            "name": "Fungi",
            "id": 4751
          },
          "phylum": {
            "name": "Ascomycota",
            "id": 4890
          },
          "class": {
            "name": "Eurotiomycetes",
            "id": 147545
          },
          "order": {
            "name": "Onygenales",
            "id": 33183
          },
          "family": {
            "name": "Onygenaceae",
            "id": 33184
          },
          "genus": {
            "name": "Coccidioides",
            "id": 5500
          },
          "species": {
            "name": "Coccidioides immitis",
            "id": 5501
          }
        },
        "parents": [
          1,
          131567,
          2759,
          33154,
          4751,
          451864,
          4890,
          716545,
          147538,
          716546,
          147545,
          451871,
          33183,
          33184,
          5500
        ],
        "children": [
          246410,
          454286,
          404692,
          396776
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 5
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 9974
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 147
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 29
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 9797
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 1
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Coccidioides immitis"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 199306,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Coccidioides posadasii",
          "authority": "M.C. Fisher, G.L. Koenig, T.J. White & J.W. Taylor, 2002"
        },
        "group_name": "ascomycete fungi",
        "has_type_material": true,
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "kingdom": {
            "name": "Fungi",
            "id": 4751
          },
          "phylum": {
            "name": "Ascomycota",
            "id": 4890
          },
          "class": {
            "name": "Eurotiomycetes",
            "id": 147545
          },
          "order": {
            "name": "Onygenales",
            "id": 33183
          },
          "family": {
            "name": "Onygenaceae",
            "id": 33184
          },
          "genus": {
            "name": "Coccidioides",
            "id": 5500
          },
          "species": {
            "name": "Coccidioides posadasii",
            "id": 199306
          }
        },
        "parents": [
          1,
          131567,
          2759,
          33154,
          4751,
          451864,
          4890,
          716545,
          147538,
          716546,
          147545,
          451871,
          33183,
          33184,
          5500
        ],
        "children": [
          443226,
          469471
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 13
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 8510
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 163
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 2
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 8342
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 1
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Coccidioides posadasii"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 7175,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Culex pipiens",
          "authority": "Linnaeus, 1758"
        },
        "curator_common_name": "northern house mosquito",
        "group_name": "mosquitos",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "kingdom": {
            "name": "Metazoa",
            "id": 33208
          },
          "phylum": {
            "name": "Arthropoda",
            "id": 6656
          },
          "class": {
            "name": "Insecta",
            "id": 50557
          },
          "order": {
            "name": "Diptera",
            "id": 7147
          },
          "family": {
            "name": "Culicidae",
            "id": 7157
          },
          "genus": {
            "name": "Culex",
            "id": 7174
          },
          "species": {
            "name": "Culex pipiens",
            "id": 7175
          }
        },
        "parents": [
          1,
          131567,
          2759,
          33154,
          33208,
          6072,
          33213,
          33317,
          1206794,
          88770,
          6656,
          197563,
          197562,
          6960,
          50557,
          85512,
          7496,
          33340,
          33392,
          7147,
          7148,
          43786,
          41827,
          7157,
          43817,
          53550,
          7174,
          53527,
          518105
        ],
        "children": [
          1833972,
          38569,
          42434,
          233155
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 5
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 19673
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 686
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 155
          },
          {
            "type": "COUNT_TYPE_snRNA",
            "count": 58
          },
          {
            "type": "COUNT_TYPE_snoRNA",
            "count": 9
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 16298
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 1620
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Culex pipiens"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 1773,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Mycobacterium tuberculosis",
          "authority": "(Zopf 1883) Lehmann and Neumann 1896 (Approved Lists 1980)",
          "basionym": {
            "name": "\"Bacterium tuberculosis\"",
            "authority": "Zopf 1883",
            "notes": [
              {
                "name": "Effective Name",
                "note": "This is an effectively published name.",
                "note_classifier": "effective_name"
              }
            ]
          }
        },
        "group_name": "high G+C Gram-positive bacteria",
        "has_type_material": true,
        "classification": {
          "superkingdom": {
            "name": "Bacteria",
            "id": 2
          },
          "kingdom": {
            "name": "Bacillati",
            "id": 1783272
          },
          "phylum": {
            "name": "Actinomycetota",
            "id": 201174
          },
          "class": {
            "name": "Actinomycetes",
            "id": 1760
          },
          "order": {
            "name": "Mycobacteriales",
            "id": 85007
          },
          "family": {
            "name": "Mycobacteriaceae",
            "id": 1762
          },
          "genus": {
            "name": "Mycobacterium",
            "id": 1763
          },
          "species": {
            "name": "Mycobacterium tuberculosis",
            "id": 1773
          }
        },
        "parents": [
          1,
          131567,
          2,
          1783272,
          201174,
          1760,
          85007,
          1762,
          1763,
          77643
        ],
        "children": [
          1427330,
          1427329
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 7819
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 4008
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 45
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 3
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 3906
          },
          {
            "type": "COUNT_TYPE_miscRNA",
            "count": 2
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 20
          },
          {
            "type": "COUNT_TYPE_OTHER",
            "count": 2
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Mycobacterium tuberculosis"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5833,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Plasmodium falciparum"
        },
        "curator_common_name": "malaria parasite P. falciparum",
        "group_name": "apicomplexans",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "phylum": {
            "name": "Apicomplexa",
            "id": 5794
          },
          "class": {
            "name": "Aconoidasida",
            "id": 422676
          },
          "order": {
            "name": "Haemosporida",
            "id": 5819
          },
          "family": {
            "name": "Plasmodiidae",
            "id": 1639119
          },
          "genus": {
            "name": "Plasmodium",
            "id": 5820
          },
          "species": {
            "name": "Plasmodium falciparum",
            "id": 5833
          }
        },
        "parents": [
          1,
          131567,
          2759,
          2698737,
          33630,
          5794,
          422676,
          5819,
          1639119,
          5820,
          418107
        ],
        "children": [
          478864,
          1036723
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 67
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 5618
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 45
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 28
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 5285
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 102
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Plasmodium falciparum"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5860,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Plasmodium vinckei",
          "authority": "(Rodhain, 1952)"
        },
        "group_name": "apicomplexans",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "phylum": {
            "name": "Apicomplexa",
            "id": 5794
          },
          "class": {
            "name": "Aconoidasida",
            "id": 422676
          },
          "order": {
            "name": "Haemosporida",
            "id": 5819
          },
          "family": {
            "name": "Plasmodiidae",
            "id": 1639119
          },
          "genus": {
            "name": "Plasmodium",
            "id": 5820
          },
          "species": {
            "name": "Plasmodium vinckei",
            "id": 5860
          }
        },
        "parents": [
          1,
          131567,
          2759,
          2698737,
          33630,
          5794,
          422676,
          5819,
          1639119,
          5820,
          418101
        ],
        "children": [
          54757,
          138298,
          138297,
          119398
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 10
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 5147
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 67
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 11
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 5050
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Plasmodium vinckei"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5855,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Plasmodium vivax",
          "authority": "(Grassi & Feletti, 1890)"
        },
        "curator_common_name": "malaria parasite P. vivax",
        "group_name": "apicomplexans",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "phylum": {
            "name": "Apicomplexa",
            "id": 5794
          },
          "class": {
            "name": "Aconoidasida",
            "id": 422676
          },
          "order": {
            "name": "Haemosporida",
            "id": 5819
          },
          "family": {
            "name": "Plasmodiidae",
            "id": 1639119
          },
          "genus": {
            "name": "Plasmodium",
            "id": 5820
          },
          "species": {
            "name": "Plasmodium vivax",
            "id": 5855
          }
        },
        "parents": [
          1,
          131567,
          2759,
          2698737,
          33630,
          5794,
          422676,
          5819,
          1639119,
          5820,
          418103
        ],
        "children": [
          31273,
          126793,
          1035514,
          1035515,
          882766,
          1077284,
          1033975
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 19
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 5513
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 44
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 22
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 5395
          },
          {
            "type": "COUNT_TYPE_miscRNA",
            "count": 10
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Plasmodium vivax"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5861,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Plasmodium yoelii"
        },
        "group_name": "apicomplexans",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "phylum": {
            "name": "Apicomplexa",
            "id": 5794
          },
          "class": {
            "name": "Aconoidasida",
            "id": 422676
          },
          "order": {
            "name": "Haemosporida",
            "id": 5819
          },
          "family": {
            "name": "Plasmodiidae",
            "id": 1639119
          },
          "genus": {
            "name": "Plasmodium",
            "id": 5820
          },
          "species": {
            "name": "Plasmodium yoelii",
            "id": 5861
          }
        },
        "parents": [
          1,
          131567,
          2759,
          2698737,
          33630,
          5794,
          422676,
          5819,
          1639119,
          5820,
          418101
        ],
        "children": [
          73239,
          1050261,
          31274,
          283801,
          1323249,
          1050262
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 15
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 6233
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 52
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 39
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 6037
          },
          {
            "type": "COUNT_TYPE_ncRNA",
            "count": 47
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Plasmodium yoelii"
      ]
    },
    {
      "taxonomy": {
        "tax_id": 5811,
        "rank": "SPECIES",
        "current_scientific_name": {
          "name": "Toxoplasma gondii"
        },
        "group_name": "apicomplexans",
        "classification": {
          "superkingdom": {
            "name": "Eukaryota",
            "id": 2759
          },
          "phylum": {
            "name": "Apicomplexa",
            "id": 5794
          },
          "class": {
            "name": "Conoidasida",
            "id": 1280412
          },
          "order": {
            "name": "Eucoccidiorida",
            "id": 75739
          },
          "family": {
            "name": "Sarcocystidae",
            "id": 5809
          },
          "genus": {
            "name": "Toxoplasma",
            "id": 5810
          },
          "species": {
            "name": "Toxoplasma gondii",
            "id": 5811
          }
        },
        "parents": [
          1,
          131567,
          2759,
          2698737,
          33630,
          5794,
          1280412,
          5796,
          75739,
          423054,
          5809,
          5810
        ],
        "children": [
          933077,
          398031
        ],
        "counts": [
          {
            "type": "COUNT_TYPE_ASSEMBLY",
            "count": 29
          },
          {
            "type": "COUNT_TYPE_GENE",
            "count": 8925
          },
          {
            "type": "COUNT_TYPE_tRNA",
            "count": 183
          },
          {
            "type": "COUNT_TYPE_rRNA",
            "count": 424
          },
          {
            "type": "COUNT_TYPE_PROTEIN_CODING",
            "count": 8318
          }
        ],
        "genomic_moltype": "dsDNA",
        "current_scientific_name_is_formal": true
      },
      "query": [
        "Toxoplasma gondii"
      ]
    }
  ],
  "total_count": 10
}

From this response we would like to render the following fields on a page (only showing two rows)

[ ] Taxon TaxId # Assemblies Tags
[ ] Anopheles gambiae 7165 7 Vector
[ ] Coccidioides immitis 5501 5 Fungi

These are populated from:

Genomes page

image

Now let's suppose on the previous page a clicked both Anopheles gambiae and Coccidioides immitis checkboxes and selected "Go to Genomes" button.

This will be equivalent to passing the following GET request:

https://api.ncbi.nlm.nih.gov/datasets/v2/genome/taxon/7165%2C5501/dataset_report?filters.assembly_source=refseq&filters.has_annotation=true&filters.exclude_paired_reports=true&filters.exclude_atypical=true&filters.assembly_level=scaffold&filters.assembly_level=chromosome&filters.assembly_level=complete_genome

Which will be rendered as the following genome page:

[ ] Taxon TaxId Accession IsRef Level # Chr Len # Scaffolds Scaffold N50 Scaffold L50 Coverage GC% Ann Status
[ ] Anopheles gambiae 7165 GCF_943734735.2 Yes Chromosome 3 264451381 190 99149756 2 54.0x 44.5 Full annotation
NoopDog commented 2 weeks ago

Ok thx @nekrut we will start on this and collect the tables from NCBI...

NoopDog commented 2 weeks ago

Also link to UCSC genome browser in the genome file.

d-callan commented 1 week ago

Sry, not sure this is the right place for this comment.. but were it me I'd seriously consider adding some kinetoplastids to that list of initial taxa.

d-callan commented 1 week ago

T. Cruzi T. Brucei Leish major Leish donovoni Leish brazilensis

Those are the ones coming to me off the top of my head, though I feel like that's maybe missing a big leish species or two. I might not have the spelling quite right either.. it'd give you Chagas, African sleeping sickness and iirc all three forms of leish though I need to double check that. Considering the popularity of tritrypdb and the impact of these diseases, these species would be a very notable omission.

Also, pretty sure we now have a few locally acquired cases of mucosal leish in Texas, as the sandfly habitat expands, so there's 'local' relevance.. thanks global warming

nekrut commented 6 days ago

Here is the initial set pf species https://docs.google.com/spreadsheets/d/1Gg9sw2Qw765tOx2To53XkTAn-RAMiBtqYrfItlLXXrc/edit?usp=sharing

(replaces #153 )

hunterckx commented 6 days ago

@nekrut Question -- how can we map the genomes returned by NCBI to the UCSC Browser URLs specified in assemblyList.json? Previously we matched Genome Version/Assembly ID from this genomes spreadsheet with either genBank or refSeq from the assembly list, but I'm not familiar enough with what the fields mean to determine which ID(s) from the NCBI API would be necessary to match with the ones in the assembly list.

Thanks!

nekrut commented 6 days ago

@nekrut Question -- how can we map the genomes returned by NCBI to the UCSC Browser URLs specified in assemblyList.json? Previously we matched Genome Version/Assembly ID from this genomes spreadsheet with either genBank or refSeq from the assembly list, but I'm not familiar enough with what the fields mean to determine which ID(s) from the NCBI API would be necessary to match with the ones in the assembly list.

Thanks!

Good point. They need to be built first. I will initiate process over the weekend. This can happen very quickly, but for now let's not link them to UCSC yet.