Bioconductor / GenomicDataCommons

Provide R access to the NCI Genomic Data Commons portal.
http://bioconductor.github.io/GenomicDataCommons/
83 stars 23 forks source link

Higher-level functionality than just data access using GenomicDataCommons package #40

Closed igordot closed 6 years ago

igordot commented 7 years ago

This isn't really an issue (or maybe a documentation issue). This package allows querying and downloading the GDC data. Does it stop there?

For example, TCGAbiolinks performs a similar function and can create a SummarizedExperiment or a data.frame from the downloaded data. Does GenomicDataCommons do something like that? Can it transform the downloaded files into some sort of a matrix structure?

seandavi commented 7 years ago

At this time (likely to change over time), the GenomicDataCommons package is meant as infrastructure on which to build higher-level packages like TCGABiolinks. Note that TCGABiolinks is somewhat TCGA centric. While the GDC contains mainly TCGA data right now, there are new datasets coming online that will differ significantly in annotation and metadata details.

All that said, I am more than happy to consider pull requests for such functionality. If you do something useful with the package, definitely let me know and I'll try to incorporate it.

igordot commented 7 years ago

Thank you for clarifying. It would be really nice if this was added or even if you just add some suggestions on how to best deal with the downloaded files.

seandavi commented 7 years ago

What would you want the SummarizedExperiment to include? What types of metadata, in particular, would be the minimum in your mind?

igordot commented 7 years ago

I mostly care about the actual data (a matrix of samples and measurements). Every experiment type comes in a different format and it's not trivial to combine them, especially since some of the files aren't really read.table-friendly. For metadata, a GRanges of locations is probably most useful.

seandavi commented 7 years ago

Thanks, @igordot. I'm working on a solution for us. It will likely take a few days, but the idea will be to have a single or small number of functions that return meaningful data containers. I'm planning on minimal sample information going into these containers, as code for sample annotation may end up being fragile. Stay tuned and thanks for clarifying your interest.

igordot commented 7 years ago

That would be great.

Sample annotation comes in many different forms. It's hard to generalize. I wouldn't expect much at the beginning. I don't even know if it's ever possible to get that part right.

lawremi commented 7 years ago

Any progress on this? It would be nice to be able to download e.g. a TCGA dataset as a MultiAssayExperiment. Could happen in a different package, of course.

vjcitn commented 7 years ago

https://docs.google.com/spreadsheets/d/1Ih64DDS5mqDlYFzDyCY9HAUnxvI1b6hapKP_akFuNPY/edit#gid=0

that's a listing of serialized MAEs for TCGA sets. is that what you have in mind?

On Thu, Apr 27, 2017 at 5:23 PM, Michael Lawrence notifications@github.com wrote:

Any progress on this? It would be nice to be able to download e.g. a TCGA dataset as a MultiAssayExperiment. Could happen in a different package, of course.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/GenomicDataCommons/issues/40#issuecomment-297843243, or mute the thread https://github.com/notifications/unsubscribe-auth/AEaOwpn4Boy3ee1J7AOqBHrSO4zbdPC6ks5r0QdAgaJpZM4Meo_v .

lawremi commented 7 years ago

Yes, that's essentially what I want, although I'm looking for a way to retrieve a dataset as part of an automatic testing procedure (caching using BiocFileCache, although it would be nice if the data access layer took care of that for me).

vjcitn commented 7 years ago

Would it help to add a function to MultiAssayExperiment that would do this?

I don't know if these RDS files are or could be part of ExperimentHub, or could be accessed through extensions to ExperimentHub. This "pattern" of remote RDS where the data modeling has been taken care of by Bioconductor, but the entity is not distributed in a package, may be worthy of more discussion. Specifically can we have the help and testing systems deal with these entities as if they are package constituents? Should we have a way of wrapping them in package-like interfaces to allow reuse of testing and help faciliites?

There are lots of integrity checks that go on with Bioconductor operations (optional validObject upon subsetting for example) that are easy and implicit when we are in-memory, in-package, but involve more work for alternatives that are getting more attractive with large volumes.

On Thu, Apr 27, 2017 at 10:34 PM, Michael Lawrence <notifications@github.com

wrote:

Yes, that's essentially what I want, although I'm looking for a way to retrieve a dataset as part of an automatic testing procedure (caching using BiocFileCache, although it would be nice if the data access layer took care of that for me).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/GenomicDataCommons/issues/40#issuecomment-297891348, or mute the thread https://github.com/notifications/unsubscribe-auth/AEaOwm77mzuS6gQ7RVRcYMfSKaWD2KC_ks5r0VAegaJpZM4Meo_v .

lawremi commented 7 years ago

The package defining the data model should probably not implement specific ways of retrieving / loading the data. There are many of those. Rather, the data retrieval layer should provide data in standard forms. Perhaps this package is too low-level for that (although for a low-level package it provides a little too much UI in the form of shiny and magrittr), but there needs to be some package or set of packages that maps GDC to Bioc data structures. Like what rtracklayer does for UCSC and files.

So it depends on what this package aims to be. The vignette lists some "interesting" design decisions (e.g., S3-based and pipe-based) that suggest this package aims to be part of the general tidyverse, ROpenSci, etc, world, with the expectation that "proper" Bioconductor packages will extend it to make it useful. Maybe the original authors could comment on their intent.

lawremi commented 7 years ago

@vjcitn do you have code that generates MAE's using this API? Otherwise, I could write something. Either way, let's try to get something into a separate package that wraps this one, if that's OK with everyone.

vjcitn commented 7 years ago

No I don't and I have no indication that this has been attempted. The MAEs were generated in Levi Waldron's group, carboned.

On Mon, May 1, 2017 at 2:53 PM, Michael Lawrence notifications@github.com wrote:

@vjcitn https://github.com/vjcitn do you have code that generates MAE's using this API? Otherwise, I could write something. Either way, let's try to get something into a separate package that wraps this one, if that's OK with everyone.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/GenomicDataCommons/issues/40#issuecomment-298401543, or mute the thread https://github.com/notifications/unsubscribe-auth/AEaOwkvI7buvDQ5V_EdlkswAn9o4mXQLks5r1ioigaJpZM4Meo_v .

schifferl commented 7 years ago

Hello, just a note on how the MAE objects were generated – we used RTCGAToolbox as the starting point for our pipeline. I wrote it in the beginning but @LiNk-NY has since taken over. We intend to publish our work as an ExperimentHub package via Bioconductor. That said, I think it would be very cool to wrap the GDC package and use it to generate MAE objects. Happy to say more if that is indeed the goal.

seandavi commented 7 years ago

I have been meaning to work on this, but haven't gotten around to it. The TCGABiolinks package is getting pretty good at this type of thing if you want to give that a try in the short run. I had in mind modular, pluggable functionality for pulling datasets in. BiocFileCache might be useful for local caching, but note that BiocFileCache is built around the concept of a remote resource, it has baked in some assumptions about that remote resource as a URL being available via a URL. This is mostly true, but the GDC API allows non-URL-based access to data (via their data transfer tool). I haven't figured out how to deal with that situation.

I'll try to sketch out something this week for input from everyone.

schifferl commented 7 years ago

https://github.com/waldronlab/curatedTCGAData

mtmorgan commented 7 years ago

FWIW BiocFileCache can be used in a two-step process -- create a named resource path = bfcnew(BiocFileCache(), "my-rsrc") and then use the path as you would any other file-based operation x = arbitrary_stuff(); saveRDS(x, path) / readRDS(path) or at a later date readRDS(bfcrpath(BiocFileCache(), "my-rsrc")).

lwaldron commented 7 years ago

We adopted RTCGAToolbox before this or TCGABiolinks were around, and it works just well enough that we haven't taken the time to replace it. Since TCGA is static now I figured that once core Bioconductor objects were created for the unrestricted datasets, and these were accessible through ExperimentHub, it wouldn't matter how which route we took to get them. But of course a GDC wrapper creating MAE objects would provide more and I'd love to see it happen.

LiNk-NY commented 6 years ago

We've agreed that this should be in a different package possibly in MultiAssayExperimentData. In the meantime, it is not a GenomicDataCommons issue.