Closed dyf closed 5 years ago
196345ba230131f41c9e57f2a367c0e682b4676e
I've finished adding whatever foreign keys to this dataset as long as the associated objects are available. This is not true of things like facs/rseq/cell_prep objects which are only in lims.
Here is a listing of how things map to warehouse and/or lims for reference.
<xarray.Dataset>
Dimensions: (age_id: 15928, gene: 50281, nucleus: 15928)
Coordinates:
* nucleus (nucleus) int64 556012415 556012410 ... -> specimens.id
* gene (gene) int64 353007 353008 353009 353010 ... -> genes.entrez_id where organism_id = 1 and reference_genome_id = 486794203
Data variables:
age_days (nucleus) object '19710' '19710' '19710' ... -> ages.days where organism_id = 1
brain_hemisphere (nucleus) object 'L' 'L' 'L' 'L' 'L' 'L' ... -> hemispheres.symbol
brain_region (nucleus) object 'MTG' 'MTG' 'MTG' 'MTG' ... -> structures.acronym
brain_subregion (nucleus) object 'L5' 'L5' 'L5' 'L5' 'L5' ... -> structures.acronym where ontology_id = 573949207
chromosome (gene) object '6' '6' '6' '6' '3' '19' ... -> chromosomes.name where organism_id = 1
class (nucleus) object 'GABAergic' ...
cluster (nucleus) object 'Inh L4-6 SST B3GAT2' ...
complexity_cg (nucleus) float64 0.3166 0.288 0.2809 ...
donor (nucleus) object 'H200.1030' 'H200.1030' ... -> donors.external_donor_name
exon_expression (gene, nucleus) int32 0 0 0 0 0 0 0 0 0 0 ...
facs_container (nucleus) object 'F1S4_160106_001' ... -> facs_plate_templates.name
facs_date (nucleus) object '1/6/2016' '1/6/2016' ...
facs_sort_criteria (nucleus) object 'NeuN-positive' ... -> facs_population_plans.name
gene_name (gene) object 'HLA complex group 26 (non-protein coding) pseudogene' ...
gene_symbol (gene) object '3.8-1.2' '3.8-1.3' ...
genes_detected_cpm_criterion (nucleus) int64 8635 11697 12138 12191 ...
genes_detected_fpkm_criterion (nucleus) int64 5253 8246 8467 8287 6922 ...
intron_expression (gene, nucleus) int32 0 0 0 0 0 0 0 0 0 0 ...
library_prep_avg_size_bp (nucleus) int64 429 448 402 399 421 454 ...
library_prep_set (nucleus) object 'L8S4_160406_02' ... -> rseq_library_prep_sets.name
mouse_homologenes (gene) object '' '' '' '' '' 'A1bg' '' ... -> gene.gene_symbol where organism_id = 2 and reference_genome_id = 486545752
organism (nucleus) object 'Homo Sapiens' ... -> organisms.name
percent_aligned_reads_total (nucleus) float64 92.42 94.0 92.87 90.2 ...
percent_ecoli_reads (nucleus) float64 0.01854 0.007265 ...
percent_exon_reads (nucleus) float64 45.08 34.92 29.98 30.23 ...
percent_intergenic_reads (nucleus) float64 11.86 14.16 15.11 14.77 ...
percent_intron_reads (nucleus) float64 43.06 50.92 54.91 55.01 ...
percent_mt_exon_reads (nucleus) float64 0.09005 0.05247 0.04295 ...
percent_reads_unique (nucleus) float64 85.57 87.62 86.45 84.5 ...
percent_rrna_reads (nucleus) float64 0.0 0.0 3.7e-05 ...
percent_synth_reads (nucleus) float64 0.007535 0.00279 ...
rna_amplification_set (nucleus) object 'A8S4_160401_02' ... -> rna_amplification_sets.name
sample_name (nucleus) object 'F1S4_160106_001_B01' ... -> facs_well_templates.name
sample_type (nucleus) object 'Nuclei' 'Nuclei' ... -> cell_prep_sample_type.name
seq_batch (nucleus) object 'R8S4-160411-H' ... -> rseq_tube_sets.name
seq_name (nucleus) object 'LS-15051_S02_E1-50' ... -> rseq_experiment_component
seq_tube (nucleus) object 'LS-15051' 'LS-15051' ... -> rseq_tubes.name
sex (nucleus) object 'M' 'M' 'M' 'M' 'M' 'M' ... -> genders.name
total_reads (nucleus) int64 2572946 2755839 2701064 ...
cc: @gautacharya
These three calls replicate the information found in the plots on the landing page of http://celltypes.brain-map.org/rnaseq/human under ("Navigator Overview" > "Dataset Overview"). I spot checked and the numbers match up.
curl -H "Content-Type:application/json" -d '{"procedure": "org.brain-map.api.datacube.groupby.human_mtg_transcriptomics", "args": [], "kwargs": {"field":"nucleus", "groupby":["cluster", "donor"], "agg_func":"size", "sort":["cluster", "donor"], "ascending":[true,true]}}' http://devdatacube:8080/call
curl -H "Content-Type:application/json" -d '{"procedure": "org.brain-map.api.datacube.groupby.human_mtg_transcriptomics", "args": [], "kwargs": {"field":"nucleus", "groupby":["cluster", "brain_subregion"], "agg_func":"size", "sort":["cluster", "brain_subregion"], "ascending":[true,true]}}' http://devdatacube:8080/call
curl -H "Content-Type:application/json" -d '{"procedure": "org.brain-map.api.datacube.groupby.human_mtg_transcriptomics", "args": [], "kwargs": {"field":"nucleus", "groupby":["cluster"], "agg_func":"size", "sort":["cluster"], "ascending":[true]}}' http://devdatacube:8080/call
Not sure where the client is going to get the information for drawing the cluster dendrogram.
Last commit makes the above calls much faster; just deployed to devdatacube. fyi @gautacharya
Here is an example of applying filters on top of a groupby:
curl -H "Content-Type:application/json" -d '{"procedure": "org.brain-map.api.datacube.groupby.human_mtg_transcriptomics", "args": [], "kwargs": {"field":"nucleus", "groupby":["cluster", "brain_subregion"], "agg_func":"size", "sort":["cluster", "brain_subregion"], "ascending":[true,true], "filters":[{"field": "brain_hemisphere", "op": "!=", "value": "R"}]}}' http://devdatacube/call
This is currently slow on dev and test, but this is addressed by https://github.com/AllenInstitute/datacube/commit/6d534da5e8414031a999e0b2c15b4c70b00208fc
time curl -s -H "Content-Type:application/json" -d '{"procedure": "org.brain-map.api.datacube.groupby.human_mtg_transcriptomics", "args": [], "kwargs": {"field":"nucleus", "groupby":["cluster", "brain_subregion"], "agg_func":"size", "sort":["cluster", "brain_subregion"], "ascending":[true,true], "filters":[{"field": "brain_hemisphere", "op": "!=", "value": "R"}]}}' http://devdatacube/call > /dev/null
real 0m0.407s
user 0m0.004s
sys 0m0.005s
This is feature-complete to my knowledge. Available on dev and test tiers and ready to deploy to higher stages if needed.
Data is here:
http://celltypes.brain-map.org/api/v2/well_known_file_download/694416044
This is a .zip file with a sample matrix and separate CSVs for row and column metadata.
The idea is that we can use this data set to prototype transcriptomics UI components, and ultimately add features to datacube related to aggregation.
Data is visualized here: http://celltypes.brain-map.org/rnaseq/human