Biological data integration: private + GEO/SRA/ArrayExpress data

joehand commented 8 years ago

From @olgabot on June 18, 2014 5:58

Hello there, This is Olga Botvinnik and Mike Lovci (@mlovci), PhD students in Bioinformatics-type fields at UC-San Diego working on a biological data analysis package for "large" (for biology) datasets. I've looked through the repo a bit and have been wondering if this would help with our data woes.

We've been having trouble coming up with a way to store the data for our projects, which would hopefully both unpublished data and publicly available data through databases such as the USA's Gene Expression Omnibus and Europe's ArrayExpress. R's BioconductoR has a pretty nice schema for a single experiment, where each experiment has not only data but also:

pData (phenotype data), describing the samples in the dataset, e.g. if they're from different celltypes or different timepoints, or one is a replicate of another. This is probably most analogous to metadata as described here.
fData (feature data), describing the features, e.g. if it's gene expression, what kind of gene it is (like an enzyme vs structural gene)

However, this is only for one datatype at a time (plus it's in R and we prefer Python), and ultimately, we'd like to ask the killer biological questions which integrate all these data types at once. For example, we'd want to mix together gene expression and DNA mutation data and see what mutations lead to changes in gene expression, and right now the only way to do that is with hella data munging and lookups and crazy queries across data types.

What we'd like to have: For a single "biological study" which addresses some biological question (e.g. "how does mutation affect gene expression"?), be able to pull down the following, reproducibly:

All data produced by all experiments (unique for each study)
- gene expression data for all samples, possibly across different timepoints
- mutation data for all samples
- ... any other relevant data (infinite possibilities) ...
All "phenotype data" aka metadata about the experiment
All "feature data" (not necessarily unique for each study)
- metadata about all genes in current human genome build (yes there are builds of the human genome :) )
- metadata about all currently known mutations in human genome
- ... metadata on remaining datatypes ...

An example of how wild and wacky these experiments can get is a similar package written in R (made only for outputs of specific bioinformatics programs, and is not open-source) which has this data schema: http://compbio.mit.edu/cummeRbund/images/plots/cuffData_schema.pdf

Ideally, this would also include hooks for auto-downloading and generating compatible datasets from gene expression data deposited into GEO and ArrayExpress (mentioned above), and across different species, so we could compare our data to published data in human, and also look at studies from mice (in vivo, live mice) vs cells in humans (in vitro, usually samples taken from cutting off a tiny piece of skin from a person).

Additionally, I got some funding to do this kind of thing so email me (obotvinn@ucsd.edu) if you think our use case is applicable to dat.

Copied from original issue: maxogden/dat#129

joehand commented 8 years ago

From @jbenet on June 18, 2014 22:35

Hello @olgabot!

It's great to hear from you both! You are precisely the sort of users that motivate our working on these sorts of tools. We want to make working with scientific data significantly easier and more powerful than the current tools allow. Also, dat is designed to be built upon, so if the dat core tool doesn't solve your particular use case exactly, it's likely that a more specialized tool can be built on top of dat to provide the extra features.

Inlined below:

We've been having trouble coming up with a way to store the data for our projects, which would hopefully both unpublished data and publicly available data

Dat can be used both public and privately. While we're super focused on Open Data and want to make publishing data to the world hyper easy, we also recognize that scientists in some fields today are compelled to keep their data private before publication. Dat doesn't force either direction, it's more of a protocol that governs how data is stored, versioned, and transmitted. Like Git, you can always use dat with SSL (we haven't yet built this into the tool, but at some point it will be).

Also, I should note dat's API and UI are likely to have authentication built in at some point too. So you'll be able to log in to an admin console on the web to specify whether your dat instance is read/write, read only, or even specify more granular permissions. All this TBD, but hearing your precise use cases now (around data sharing, public and private) will help us design better workflows that address your concerns.

where each experiment has not only data but also: ... pData ... fData ...

This sounds like a perfect use case for dat. I imagine that each data, pData, and fData have different schemas. For now, the easiest way to do this is to either (a) have three separate dat instances, or (b) have record rows only fill in the columns for their subset. We're looking to address structured data in the future as well. You can follow along that discussion here: https://github.com/maxogden/dat/issues/101

ultimately, we'd like to ask the killer biological questions which integrate all these data types at once. For example, we'd want to mix together gene expression and DNA mutation data and see what mutations lead to changes in gene expression, and right now the only way to do that is with hella data munging and lookups and crazy queries across data types.

Yeah, ultimately either queries (on demand record construction) or views (pre-built results to a query) are the way to do this. Dat will likely make all of this a lot easier, once the right workflows emerge. For now, I can imagine, either:

run a dat (or set of dats) with the source data, then run filters + transforms within that dat
run source dat(s), then run filter + transform hooks on pull to a target dat, which will contain your results. This use case will allow streaming, when new data is added to the source dat, it gets run through the processing pipeline and also added to the target dat.

What we'd like to have: For a single "biological study" which addresses some biological question (e.g. "how does mutation affect gene expression"?), be able to pull down the following, reproducibly:

Yes! This is what dat will excel at early on. You'll be able to host a dat instance, say at:

http://dathub.org/olgabot/biostudy-mutation-affect-gene-expression

If you went to that site, you'd see an interface to your data. A version of: https://github.com/maxogden/dat-editor

Other people can download directly from there, or by simply running:

dat clone http://dathub.org/olgabot/biostudy-mutation-affect-gene-expression

They'll then have a comple working copy locally, and will be able to run queries themselves, or fork the dataset, etc.

An example of how wild and wacky these experiments can get is a similar package written in R (made only for outputs of specific bioinformatics programs, and is not open-source) which has this data schema: http://compbio.mit.edu/cummeRbund/images/plots/cuffData_schema.pdf

... I'm so sorry. But yes, you'll be able to add all that data to dat instances (one or multiple, depends on your workflow, performance reqs, etc).

Ideally, this would also include hooks for auto-downloading and generating compatible datasets from gene expression data deposited into GEO and ArrayExpress (mentioned above), and across different species, so we could compare our data to published data in human, and also look at studies from mice (in vivo, live mice) vs cells in humans (in vitro, usually samples taken from cutting off a tiny piece of skin from a person).

Yes! Our goal is to make all transformations, filters, queries, etc simply "data pipelines" that you can add as hooks to your dat instances. So, potentially, an interface like:

# (given mice-studies is a pipeline specified in the source dat)

# filter to mice studies only
dat clone http://dathub.org/olgabot/biostudy-mutation-affect-gene-expression --pipeline  mice-studies

# or
dat transform mice-studies

Additionally, I got some funding to do this kind of thing so email me (obotvinn@ucsd.edu) if you think our use case is applicable to dat.

Your use case is definitely applicable. It would probably be good to setup a +hangout with @maxogden and discuss what minimal workflow you'd benefit right away from and see how far that is. It may already be here! As our target users, your feedback will be super helpful to define the interfaces, workflows, and semantics of the tools we're building. Do you use IRC? you can find us at #dat on freenode. Else, we can email.

(Btw, I think I went to High School with @mlovci ?)

joehand commented 8 years ago

From @jbenet on June 18, 2014 22:37

Btw, a general note is that dat workflows are inspired by git's. So think distributed, local repo, etc. But tuned to handle all the operations you want to run on data (fetching only filtered susbets, applying transforms, running queries, etc). + both cli and webuis :)

joehand commented 8 years ago

From @mlovci on June 18, 2014 23:19

Is this the Juan batiz-benet that went to high school with me?

On Wed, Jun 18, 2014 at 3:37 PM, Juan Batiz-Benet notifications@github.com wrote:

Btw, a general note is that dat workflows are inspired by git's. So think distributed, local repo, etc. But tuned to handle all the operations you want to run on data (fetching only filtered susbets, applying transforms, running queries, etc).

— Reply to this email directly or view it on GitHub https://github.com/maxogden/dat/issues/129#issuecomment-46503532.

joehand commented 8 years ago

From @jbenet on June 18, 2014 23:25

(Btw, I think I went to High School with @mlovci ?)

@mlovci

Is this the Juan batiz-benet that went to high school with me?

yep! hi! :)

joehand commented 8 years ago

From @mlovci on June 18, 2014 23:26

99% sure it is. Hi Juan! Long time no see.

On Wed, Jun 18, 2014 at 4:19 PM, Michael Lovci michaeltlovci@gmail.com wrote:

Is this the Juan batiz-benet that went to high school with me?

On Wed, Jun 18, 2014 at 3:37 PM, Juan Batiz-Benet < notifications@github.com> wrote:

Btw, a general note is that dat workflows are inspired by git's. So think distributed, local repo, etc. But tuned to handle all the operations you want to run on data (fetching only filtered susbets, applying transforms, running queries, etc).

— Reply to this email directly or view it on GitHub https://github.com/maxogden/dat/issues/129#issuecomment-46503532.

joehand commented 8 years ago

From @mlovci on June 18, 2014 23:27

oh, yes, now i see it...

small world. Github is so cool!

On Wed, Jun 18, 2014 at 4:25 PM, Juan Batiz-Benet notifications@github.com wrote:

(Btw, I think I went to High School with @mlovci https://github.com/mlovci ?)

@mlovci https://github.com/mlovci

Is this the Juan batiz-benet that went to high school with me? yep! hi! :)

— Reply to this email directly or view it on GitHub https://github.com/maxogden/dat/issues/129#issuecomment-46507202.

joehand commented 8 years ago

From @bmpvieira on June 20, 2014 0:32

Hi @olgabot and @mlovci,

I've been fetching metadata from NCBI in JSON and storing it in Dat. I've just released the module I wrote to do that. I've used it mostly on SRA, biosample, bioproject, assembly, taxonomy and pubmed, but it seems to also work with GEO DataSets (gds) and others.

It should be as easy as:

npm install dat bionode-ncbi -g
dat init
bionode-ncbi search gds Solenopsis | dat import --json
dat serve

Hope it will be useful, but please beware that it probably still has many bug to be found, and if you found one please report it! :)

joehand commented 8 years ago

From @olgabot on June 20, 2014 1:3

Wow, this is REALLY awesome! Thank you so much! I'm not familiar with node or javascript, could you show me the command for how to access the GEO dataset GSE48968 ?

joehand commented 8 years ago

From @bmpvieira on June 20, 2014 15:6

You can just do this in your terminal:

bionode-ncbi search gds GSE48968 > results

I you are on a mac and have brew, you can install node with it and then do the commands from the previous comment to install bionode-ncbi and dat with npm.

joehand commented 8 years ago

From @olgabot on June 20, 2014 16:34

Yes, I've installed node on my mac via brew, and the npm install commands (from 2 messages ago) worked fine, but the bionode-ncbi comand (from the previous message) got me this error:

Olgas-MacBook-Pro-1378:flotilla olga$ bionode-ncbi search gds GSE48968 > results

stream.js:94
      throw er; // Unhandled stream error in pipe.
            ^
Error: Unexpected "!" at position 1 in state START
    at Parser.proto.charError (/usr/local/lib/node_modules/bionode-ncbi/node_modules/JSONStream/node_modules/jsonparse/jsonparse.js:84:16)
    at Parser.proto.write (/usr/local/lib/node_modules/bionode-ncbi/node_modules/JSONStream/node_modules/jsonparse/jsonparse.js:112:23)
    at Stream.<anonymous> (/usr/local/lib/node_modules/bionode-ncbi/node_modules/JSONStream/index.js:22:12)
    at Stream.stream.write (/usr/local/lib/node_modules/bionode-ncbi/node_modules/JSONStream/node_modules/through/index.js:26:11)
    at Request.ondata (stream.js:51:26)
    at Request.EventEmitter.emit (events.js:95:17)
    at IncomingMessage.<anonymous> (/usr/local/lib/node_modules/bionode-ncbi/node_modules/request/request.js:932:12)
    at IncomingMessage.EventEmitter.emit (events.js:95:17)
    at IncomingMessage.<anonymous> (_stream_readable.js:746:14)
    at IncomingMessage.EventEmitter.emit (events.js:92:17)

joehand commented 8 years ago

From @bmpvieira on June 20, 2014 19:57

Yes, sorry this seems to occur mostly with gds and to be related to the max number of items fetched from NCBI per request. I had already reduced from 1000 to 500 and it seemed to have solved it, but now I've reduced to 250. It shouldn't affect speed to much (need to test), it only means that we do more requests to NCBI and ask for less items.

Please update bionode-ncbi and try again:

npm update bionode-ncbi -g

dat-ecosystem-archive / datproject-discussions

Biological data integration: private + GEO/SRA/ArrayExpress data #45