AllenInstitute / datacube

Other
0 stars 1 forks source link

human MTG transcriptomics service #99

Closed dyf closed 5 years ago

dyf commented 6 years ago

Data is here:

http://celltypes.brain-map.org/api/v2/well_known_file_download/694416044

This is a .zip file with a sample matrix and separate CSVs for row and column metadata.

The idea is that we can use this data set to prototype transcriptomics UI components, and ultimately add features to datacube related to aggregation.

Data is visualized here: http://celltypes.brain-map.org/rnaseq/human

chrisbarber commented 5 years ago

https://github.com/AllenInstitute/datacube/commit/55dc5922e05b21b4568979985a31026ae1531df0

chrisbarber commented 5 years ago

196345ba230131f41c9e57f2a367c0e682b4676e

I've finished adding whatever foreign keys to this dataset as long as the associated objects are available. This is not true of things like facs/rseq/cell_prep objects which are only in lims.

Here is a listing of how things map to warehouse and/or lims for reference.

<xarray.Dataset>
Dimensions:                        (age_id: 15928, gene: 50281, nucleus: 15928)
Coordinates:
  * nucleus                        (nucleus) int64 556012415 556012410 ... -> specimens.id
  * gene                           (gene) int64 353007 353008 353009 353010 ... -> genes.entrez_id where organism_id = 1 and reference_genome_id = 486794203
Data variables:
    age_days                       (nucleus) object '19710' '19710' '19710' ... -> ages.days where organism_id = 1
    brain_hemisphere               (nucleus) object 'L' 'L' 'L' 'L' 'L' 'L' ... -> hemispheres.symbol
    brain_region                   (nucleus) object 'MTG' 'MTG' 'MTG' 'MTG' ... -> structures.acronym
    brain_subregion                (nucleus) object 'L5' 'L5' 'L5' 'L5' 'L5' ... -> structures.acronym where ontology_id = 573949207
    chromosome                     (gene) object '6' '6' '6' '6' '3' '19' ... -> chromosomes.name where organism_id = 1
    class                          (nucleus) object 'GABAergic' ...
    cluster                        (nucleus) object 'Inh L4-6 SST B3GAT2' ...
    complexity_cg                  (nucleus) float64 0.3166 0.288 0.2809 ...
    donor                          (nucleus) object 'H200.1030' 'H200.1030' ... -> donors.external_donor_name
    exon_expression                (gene, nucleus) int32 0 0 0 0 0 0 0 0 0 0 ...
    facs_container                 (nucleus) object 'F1S4_160106_001' ... -> facs_plate_templates.name
    facs_date                      (nucleus) object '1/6/2016' '1/6/2016' ...
    facs_sort_criteria             (nucleus) object 'NeuN-positive' ... -> facs_population_plans.name
    gene_name                      (gene) object 'HLA complex group 26 (non-protein coding) pseudogene' ...
    gene_symbol                    (gene) object '3.8-1.2' '3.8-1.3' ...
    genes_detected_cpm_criterion   (nucleus) int64 8635 11697 12138 12191 ...
    genes_detected_fpkm_criterion  (nucleus) int64 5253 8246 8467 8287 6922 ...
    intron_expression              (gene, nucleus) int32 0 0 0 0 0 0 0 0 0 0 ...
    library_prep_avg_size_bp       (nucleus) int64 429 448 402 399 421 454 ...
    library_prep_set               (nucleus) object 'L8S4_160406_02' ... -> rseq_library_prep_sets.name
    mouse_homologenes              (gene) object '' '' '' '' '' 'A1bg' '' ... -> gene.gene_symbol where organism_id = 2 and reference_genome_id = 486545752
    organism                       (nucleus) object 'Homo Sapiens' ... -> organisms.name
    percent_aligned_reads_total    (nucleus) float64 92.42 94.0 92.87 90.2 ...
    percent_ecoli_reads            (nucleus) float64 0.01854 0.007265 ...
    percent_exon_reads             (nucleus) float64 45.08 34.92 29.98 30.23 ...
    percent_intergenic_reads       (nucleus) float64 11.86 14.16 15.11 14.77 ...
    percent_intron_reads           (nucleus) float64 43.06 50.92 54.91 55.01 ...
    percent_mt_exon_reads          (nucleus) float64 0.09005 0.05247 0.04295 ...
    percent_reads_unique           (nucleus) float64 85.57 87.62 86.45 84.5 ...
    percent_rrna_reads             (nucleus) float64 0.0 0.0 3.7e-05 ...
    percent_synth_reads            (nucleus) float64 0.007535 0.00279 ...
    rna_amplification_set          (nucleus) object 'A8S4_160401_02' ... -> rna_amplification_sets.name
    sample_name                    (nucleus) object 'F1S4_160106_001_B01' ... -> facs_well_templates.name
    sample_type                    (nucleus) object 'Nuclei' 'Nuclei' ... -> cell_prep_sample_type.name
    seq_batch                      (nucleus) object 'R8S4-160411-H' ... -> rseq_tube_sets.name
    seq_name                       (nucleus) object 'LS-15051_S02_E1-50' ... -> rseq_experiment_component
    seq_tube                       (nucleus) object 'LS-15051' 'LS-15051' ... -> rseq_tubes.name
    sex                            (nucleus) object 'M' 'M' 'M' 'M' 'M' 'M' ... -> genders.name
    total_reads                    (nucleus) int64 2572946 2755839 2701064 ...

cc: @gautacharya

chrisbarber commented 5 years ago

These three calls replicate the information found in the plots on the landing page of http://celltypes.brain-map.org/rnaseq/human under ("Navigator Overview" > "Dataset Overview"). I spot checked and the numbers match up.

curl -H "Content-Type:application/json" -d '{"procedure": "org.brain-map.api.datacube.groupby.human_mtg_transcriptomics", "args": [], "kwargs": {"field":"nucleus", "groupby":["cluster", "donor"], "agg_func":"size", "sort":["cluster", "donor"], "ascending":[true,true]}}' http://devdatacube:8080/call

curl -H "Content-Type:application/json" -d '{"procedure": "org.brain-map.api.datacube.groupby.human_mtg_transcriptomics", "args": [], "kwargs": {"field":"nucleus", "groupby":["cluster", "brain_subregion"], "agg_func":"size", "sort":["cluster", "brain_subregion"], "ascending":[true,true]}}' http://devdatacube:8080/call

curl -H "Content-Type:application/json" -d '{"procedure": "org.brain-map.api.datacube.groupby.human_mtg_transcriptomics", "args": [], "kwargs": {"field":"nucleus", "groupby":["cluster"], "agg_func":"size", "sort":["cluster"], "ascending":[true]}}' http://devdatacube:8080/call

Not sure where the client is going to get the information for drawing the cluster dendrogram.

chrisbarber commented 5 years ago

Last commit makes the above calls much faster; just deployed to devdatacube. fyi @gautacharya

chrisbarber commented 5 years ago

Here is an example of applying filters on top of a groupby: curl -H "Content-Type:application/json" -d '{"procedure": "org.brain-map.api.datacube.groupby.human_mtg_transcriptomics", "args": [], "kwargs": {"field":"nucleus", "groupby":["cluster", "brain_subregion"], "agg_func":"size", "sort":["cluster", "brain_subregion"], "ascending":[true,true], "filters":[{"field": "brain_hemisphere", "op": "!=", "value": "R"}]}}' http://devdatacube/call

This is currently slow on dev and test, but this is addressed by https://github.com/AllenInstitute/datacube/commit/6d534da5e8414031a999e0b2c15b4c70b00208fc

chrisbarber commented 5 years ago
time curl -s -H "Content-Type:application/json" -d '{"procedure": "org.brain-map.api.datacube.groupby.human_mtg_transcriptomics", "args": [], "kwargs": {"field":"nucleus", "groupby":["cluster", "brain_subregion"], "agg_func":"size", "sort":["cluster", "brain_subregion"], "ascending":[true,true], "filters":[{"field": "brain_hemisphere", "op": "!=", "value": "R"}]}}' http://devdatacube/call > /dev/null

real    0m0.407s
user    0m0.004s
sys 0m0.005s
chrisbarber commented 5 years ago

This is feature-complete to my knowledge. Available on dev and test tiers and ready to deploy to higher stages if needed.