Closed carolinebridge closed 3 years ago
do you have a link for a file that has this format?
controlled: https://portal.gdc.cancer.gov/files/31ae8522-dd6a-443e-af5f-2bd0bea9da4e
open: https://portal.gdc.cancer.gov/files/1b466557-76dd-4a26-9df4-49172400fb40 , except this one is a little nutso and has a .gz within the .gz, a large sample size of compressed files with open data appear like this. files appear to be zipped whenever a MANIFEST is included with the data file
To get a direct download of the file, I think you can take the UUID of the file and use a URL format like this
"https://api.gdc.cancer.gov/data/"+file_uuid
so you get something like
https://api.gdc.cancer.gov/data/1b466557-76dd-4a26-9df4-49172400fb40
this should be just the 486778af-8f8e-4000-9812-409604e274a5.FPKM.txt.gz without the manifest and such
I stumbled on this sort of randomly but it is how the SegmentCNVAdapter works, it directly downloads the file like that
downloading the file directly using the /data/ endpoint isn't the issue, it's that the response to the request is still a compressed file that needs to be decompressed
currently I have,
const location = {uri: `http://localhost:8010/proxy/data/${query}`} as FileLocation
const lines = (await openLocation(location).readFile({headers: {'X-Auth-Token': `${token}`}, encoding: 'utf8'})) as string
where "query" is the file uuid that the user enters, and "token" is an auth token the user also enters
this is almost identical to the SegCNV work, where lines are lines of the files that can then be read. if the file being requested is uncompressed, this works perfectly and you get each line of data from the file; if the file is compressed, you get a (since it's being parsed as a string in that snippet you get a blob garbage) Uint8Array that, i believe, needs to be uncompressed to get at the actual data file we want
here are two postman requests, one to a .gz resource and one to a non-.gz resource, both open data, with their responses:
forgive me if im misunderstanding what you're getting at, please clarify if i am missing something!
indeed, it may be necessary to do an unzip on the contents.
in this case, we can try calling pako inflate on the result and it seems ok
(if you are using openLocation instead of fetch, you can omit the "utf8" argument and get a node.js buffer which is similar to an array buffer, and call buffer.toString instead of TextDecoder also)
looks like i was using the wrong pako functions when i first encountered the issue! that snippet is great, thanks colin
i'll try to use openLocation as you've advised as well
this needs to be done on a by-adapter basis rather than a single processing stream before passing data to an adapter; closing with the stipulation that bgzipped resources will be unzipped in their respective adapters
some controlled gdc data is compressed into a .gz; when the user requests these files we should be able to uncompress the file and grab the relevant .tsv or .vcf
currently the closest i've gotten to solving this is using pako