broadinstitute / gdctools

Python and UNIX CLI utilities to simplify interaction with the NIH/NCI Genomics Data Commons
Other
31 stars 4 forks source link

How to retrieve data from GDC using Python API #29

Open fbrundu opened 7 years ago

fbrundu commented 7 years ago

Dear all, these days I was working on a Python package to retrieve data from GDC, until I found gdctools.

I saw the README and the package overview, however it is not clear how to retrieve data using the Python API. In particular, it would be awesome if gdctools could provide data as a pandas DataFrame.

Is it ready, planned or out of scope of this project?

Thanks, Francesco

noblem commented 7 years ago

Hello Francesco,

Thank you for writing, and for considering using GDCtools! At present the toolset is geared towards mirroring and processing the data to disk, rather than loading into Python objects (or Pandas dataframes). But that kind of Python & Pandas object functionality is very much along the lines of where we hope the toolset will evolve to be useful. At the moment we can't make any promises that we'll be able to add it ourselves in the near term, but if you were to consider extending the toolset we'd be happy to accept your contributions!

Best, Mike

fbrundu commented 7 years ago

Dear Mike, Actually, GDCtools is very interesting! I look forward to see new features. Regarding my contributions, if I get the time to understand how the toolset works I'd be very happy to contribute. I think the support for pandas is very straightforward. This, for instance, is a script I did to retrieve gdc gene expression datasets from the gdc API, and return them as pandas DataFrames. Using the gdc API it is possible to add pandas support using the network API, like this:

df = pd.read_table(f'https://gdc-api.nci.nih.gov/data/{fid}', compression='gzip',
  index_col=0, header=None)

in which the variable fid has the file id to retrieve. The method read_table() can support API urls so I think it is a very straightforward way to convert API calls to pandas DataFrames.

Thanks for your tools, Best, Francesco

noblem commented 7 years ago

Neat, thank you so much for sharing this with us, Francesco. I'm connecting this thread to Sam Meier, who will experiment a little and get back to you (here in this thread) with some initial results.