GMOD / jbrowse-plugin-gdc

JBrowse 2 plugin for integrating with GDC resources
6 stars 3 forks source link

process gdc data that is downloaded as a .gz #32

Closed carolinebridge closed 3 years ago

carolinebridge commented 3 years ago

some controlled gdc data is compressed into a .gz; when the user requests these files we should be able to uncompress the file and grab the relevant .tsv or .vcf

currently the closest i've gotten to solving this is using pako

cmdcolin commented 3 years ago

do you have a link for a file that has this format?

carolinebridge commented 3 years ago

controlled: https://portal.gdc.cancer.gov/files/31ae8522-dd6a-443e-af5f-2bd0bea9da4e

open: https://portal.gdc.cancer.gov/files/1b466557-76dd-4a26-9df4-49172400fb40 , except this one is a little nutso and has a .gz within the .gz, a large sample size of compressed files with open data appear like this. files appear to be zipped whenever a MANIFEST is included with the data file

cmdcolin commented 3 years ago

To get a direct download of the file, I think you can take the UUID of the file and use a URL format like this

"https://api.gdc.cancer.gov/data/"+file_uuid

so you get something like

https://api.gdc.cancer.gov/data/1b466557-76dd-4a26-9df4-49172400fb40

this should be just the 486778af-8f8e-4000-9812-409604e274a5.FPKM.txt.gz without the manifest and such

I stumbled on this sort of randomly but it is how the SegmentCNVAdapter works, it directly downloads the file like that

carolinebridge commented 3 years ago

downloading the file directly using the /data/ endpoint isn't the issue, it's that the response to the request is still a compressed file that needs to be decompressed

currently I have,

      const location = {uri: `http://localhost:8010/proxy/data/${query}`} as FileLocation
      const lines = (await openLocation(location).readFile({headers: {'X-Auth-Token': `${token}`}, encoding: 'utf8'})) as string

where "query" is the file uuid that the user enters, and "token" is an auth token the user also enters

this is almost identical to the SegCNV work, where lines are lines of the files that can then be read. if the file being requested is uncompressed, this works perfectly and you get each line of data from the file; if the file is compressed, you get a (since it's being parsed as a string in that snippet you get a blob garbage) Uint8Array that, i believe, needs to be uncompressed to get at the actual data file we want

here are two postman requests, one to a .gz resource and one to a non-.gz resource, both open data, with their responses:

Screen Shot 2021-06-15 at 2 20 53 PM Screen Shot 2021-06-15 at 2 21 55 PM

forgive me if im misunderstanding what you're getting at, please clarify if i am missing something!

cmdcolin commented 3 years ago

indeed, it may be necessary to do an unzip on the contents.

in this case, we can try calling pako inflate on the result and it seems ok

https://codesandbox.io/s/damp-monad-0uwso?file=/src/App.js

cmdcolin commented 3 years ago

(if you are using openLocation instead of fetch, you can omit the "utf8" argument and get a node.js buffer which is similar to an array buffer, and call buffer.toString instead of TextDecoder also)

carolinebridge commented 3 years ago

looks like i was using the wrong pako functions when i first encountered the issue! that snippet is great, thanks colin

i'll try to use openLocation as you've advised as well

carolinebridge commented 3 years ago

this needs to be done on a by-adapter basis rather than a single processing stream before passing data to an adapter; closing with the stipulation that bgzipped resources will be unzipped in their respective adapters