dcppc / data-stewards

Questions and answers about TOPmed, GTEx, and AGR resources.
8 stars 0 forks source link

Cloud access to GTEx data with metadata #20

Open carlkesselman opened 6 years ago

carlkesselman commented 6 years ago

KC7 has been working on developing a interoperable representation and instantiation of metadata that can server as a mechanism for data exchange between various full stacks. This has been based on the DATS metadata model, represented in JSON-LD and serialized using BDBags. We have been developing ETL processes for coding data from data stewards in this format.

A prerequisite for this work is that we have access to the underlying (raw) data on a accessible cloud storage platform, such as AWS or Google. It would be very helpful if we could be provided the list of Google or S3 endpoints along with expected file lengths and checksums so we could work to aggregate this data in a form that can be interoperable consumed by FS teams and KC7.

zflamig commented 6 years ago

Team Calcium would love this for GTEx and TOPMed too! I think for the checksums, we would love crc32, md5, multipart md5, and sha256 btw.

jnedzel commented 6 years ago

We have already created a manifest file of the GTEx raw data, with URLs and MD5 checksums.

On May 23, 2018, at 3:11 PM, Zac Flamig notifications@github.com wrote:

Team Calcium would love this for GTEx and TOPMed too!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

cricketsloan commented 6 years ago

How do we get that manifest?

jnedzel commented 6 years ago

I've committed three manifest files into this repository.

mikedarcy commented 6 years ago

The public data manifest recently added here, contains a file_size field with values like "7.9 MiB", "15.12 KiB", and "1.82 GiB". This is inconsistent with how the file_length field is represented (in bytes) in the private data manifests (e.g., here).

It is important to provide an integer representation of the exact file size in bytes, as some tools might need this data to function properly (or optimally). Having the string-based, human-readable file_size is useful, but not a sufficient replacement for the true byte count of the data. The former can always be calculated from the latter by the consumer of the manifest.

jnedzel commented 6 years ago

Hi Mike, this is now fixed.

mikedarcy commented 6 years ago

There is an issue with the public data manifest where some rows duplicate file_name yet the URL fields refer to different locations in cloud storage. Sometimes the same actual file content is being referenced (based on the md5 provided), and sometimes an altogether different file is referenced.

For example, these two rows refer to the same content, only differing by the URL in storage:

GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx    gs://gtex_analysis_v6p/annotations/GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx 33902   4deba6c7c24b5cb5ed01df32518cda2d    https://storage.googleapis.com/gtex_analysis_v6p/annotations/GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx
GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx    gs://gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx  33902   4deba6c7c24b5cb5ed01df32518cda2d    https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx

Whereas these two rows refer to different content, but with the same filename:

description.txt gs://gtex_analysis_v6/reference/description.txt 816 d8dc9a2e3ec3e0f5c6f0c747463009f4    https://storage.googleapis.com/gtex_analysis_v6/reference/description.txt
description.txt gs://gtex_analysis_v6/annotations/description.txt   595 fd6e6d2fedb460d6a99b94c87718dd05    https://storage.googleapis.com/gtex_analysis_v6/annotations/description.txt

It is difficult to determine how to handle these kinds of records programatically. If the files are in fact the same content, then we don't really need the file uploaded to multiple different paths in cloud storage resulting in duplicate records in the manifest, and additional storage overhead. Otherwise, if the files referenced are in fact different (by virtue of md5 and cloud path) then we need a unique filename (or a relative path, e.g. gtex_analysis_v6/annotations/description.txt) to correlate with the referenced location.

Here's a list of all of the records containing duplicated file_name fields I encountered while processing the manifest (note that these are all entries AFTER the first encountered, each of which is not included in this list):

{"filename": "GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx", "url": "https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx", "length": "33902", "md5": "4deba6c7c24b5cb5ed01df32518cda2d"}
{"filename": "GTEx_Data_V6_Annotations_SampleAttributesDS.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SampleAttributesDS.txt", "length": "6285091", "md5": "6273af715b43ef89c7f9f9af8524031b"}
{"filename": "GTEx_Data_V6_Annotations_SubjectPhenotypesDS.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SubjectPhenotypesDS.txt", "length": "11666", "md5": "5e31c42421f0ff27a4c83872027012d5"}
{"filename": "GTEx_Data_V6_Annotations_SubjectPhenotypes_DD.xlsx", "url": "https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SubjectPhenotypes_DD.xlsx", "length": "22212", "md5": "ad5b5e461037c7ab14d920941ab0b821"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/annotations/description.txt", "length": "595", "md5": "fd6e6d2fedb460d6a99b94c87718dd05"}
{"filename": "Homo_sapiens_assembly19.fasta.gz", "url": "https://storage.googleapis.com/gtex_analysis_v6/reference/Homo_sapiens_assembly19.fasta.gz", "length": "857841780", "md5": "d8c8e4e848f9a16dd25f741720e668ad"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/reference/description.txt", "length": "816", "md5": "d8dc9a2e3ec3e0f5c6f0c747463009f4"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/rna_seq_data/description.txt", "length": "655", "md5": "6b693c6b74d84c506576f3abc1c0e367"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/single_tissue_eqtl_data/description.txt", "length": "677", "md5": "b73ab7510ad8d7eaa046606b11bb13cd"}
{"filename": "README.eqtls", "url": "https://storage.googleapis.com/gtex_analysis_v4/single_tissue_eqtl_data/README.eqtls", "length": "860", "md5": "bd4becc696b20737dfeda17dfdd61821"}
{"filename": "GTEx_genot_imputed_variants_info4_maf05_CR95_CHR_POSb37_ID_REF_ALT.txt.zip", "url": "https://storage.googleapis.com/gtex_analysis_pilot_v3/reference/GTEx_genot_imputed_variants_info4_maf05_CR95_CHR_POSb37_ID_REF_ALT.txt.zip", "length": "59886339", "md5": "5cf035742a19634b71605f93274c86c9"}
{"filename": "README.eqtls", "url": "https://storage.googleapis.com/gtex_analysis_pilot_v3/single_tissue_eqtl_data/README.eqtls", "length": "1294", "md5": "d4897fc04591c7aa5c8c5d7c7f2c3013"}
jnedzel commented 6 years ago

Hi Mike:

These are separate logical files. You can see them here: https://gtexportal.org/home/datasets

We release a separate, complete set of files with each new release. Between releases, sometimes those files change, sometimes they don't. But they are separate physical assets in different locations in our GCS buckets.

mikedarcy commented 6 years ago

Duplicate file content for logical files aside, if I am trying to prepare a bdbag for this dataset, I have no way to disambiguate the following unique files unless I "guess" at your intentions of file system organization by inspecting the logical path in the URL:

{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/reference/description.txt", "length": "816", "md5": "d8dc9a2e3ec3e0f5c6f0c747463009f4"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/rna_seq_data/description.txt", "length": "655", "md5": "6b693c6b74d84c506576f3abc1c0e367"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/single_tissue_eqtl_data/description.txt", "length": "677", "md5": "b73ab7510ad8d7eaa046606b11bb13cd"}

In a situation where I am trying to materialize all of the resources back into a local filesystem, I cannot resolve where all of these files with the same name are supposed to be placed without guessing your intentions based on the URL path.

If you just included the relative path that is already part of the URL field as part of your file_name field, then my issue is solved by you making an authoritative statement as to how the downloaded data should be organized.

It would also be great to have another field like dataset where you store the dataset name, e.g., "gtex_analysis_v6", which would then give me the additional explicit and authoritative information that I need to understand how the files should logically be grouped together.

For example, something like:

file_name     dataset   object_location file_size   md5_hash    public_url
annotations/description.txt gtex_analysis_v6    gs://gtex_analysis_v6/annotations/description.txt   595 fd6e6d2fedb460d6a99b94c87718dd05    https://storage.googleapis.com/gtex_analysis_v6/annotations/description.txt
reference/description.txt   gtex_analysis_v6    gs://gtex_analysis_v6/reference/description.txt 816 d8dc9a2e3ec3e0f5c6f0c747463009f4    https://storage.googleapis.com/gtex_analysis_v6/reference/description.txt

The above is unambigous and authoritative, and provides the consumer everything needed to logically restructure the files on the filesystem without requiring a priori knowledge of your cloud bucket storage hierarchy. It also has the benefit of allowing you to change those storage paths without affecting how the downstream consumer organizes the data.

jnedzel commented 6 years ago

We can add a release column next week.

mikedarcy commented 6 years ago

Awesome! What about the relative path information? Including it as part of the file_name or adding another field like path would pretty much solve all of my issues...

jnedzel commented 6 years ago

Sure, we can do that.

jnedzel commented 6 years ago

@mikedarcy I haven't forgotten about this. I will get to it next week.

mikedarcy commented 6 years ago

No problem. I am going to go ahead and make some bag versions from the manifest as-is. I will use the paths from the URLs to map back to the local file system structure. If things change I can easily regenerate the bags from a new version of the manifest.

mikedarcy commented 6 years ago

Team Argon has created bdbags for each release of the public GTEx data listed in this manifest: https://github.com/dcppc/data-stewards/blob/master/gtex/v7/manifests/public_data/gtex_manifest_file.txt

In addition to these bags, we've created a bag for the V6 release that includes file references included in both the V6 and V6p (patch) releases, and an "uber-bag" that includes references to files in all releases (basically a bag of the entire manifest).

We have assigned minid identifers to each bag, and the bag content itself can be downloaded by visiting the landing page for the corresponding identifier and downloading the zip file of the bag. You can use the bdbag Python program (https://github.com/fair-research/bdbag) to automatically download a bag's consituent files and verify the content checkums. We have independently validated all of the bags posted here by downloading the content and running the bag validation process.

"GTEx Analysis Pilot V3 in zipped bag format": http://identifiers.org/minid:b9vm4j "GTEx Analysis V4 in zipped bag format": http://identifiers.org/minid:b9qt2m "GTEx Analysis V6 in zipped bag format" http://identifiers.org/minid:b9m401 "GTEx Analysis V6p in zipped bag format": http://identifiers.org/minid:b9g98j "GTEx Analysis V6 (including V6p patch) in zipped bag format": http://identifiers.org/minid:b9bm4w "GTEx Analysis V7 in zipped bag format": http://identifiers.org/minid:b96t2z "GTEx Analysis (all releases) in zipped bag format": http://identifiers.org/minid:b9341r

mikedarcy commented 6 years ago

In the protected data file manifests:

There are md5 and file sizes for the cram file in the fields cram_file_md5 and cram_file_size, but the equivalent fields are missing for the cram_index file. Is this an oversight or is there some other reason for this data not being present?

francois-a commented 6 years ago

This was an oversight. We'll provide updated files shortly.