TOPMed VCF Paths on Google

zflamig commented 5 years ago

We have manifest at gs://topmed-irc-share/genomes/manifest.data-commons-pilot.txt which only lists the cram and crai files. Is there a companion version for the vcf files?

jonathonl commented 5 years ago

I believe the plan is to amend the existing manifest to include the VCF URIs. In the meantime, I have created a listing for the VCF files at gs://topmed-irc-share/genomes/manifest.vcfs.txt.

jonathonl commented 5 years ago

I'll plan on creating a join of these two files in the next day or two?

keanderka commented 5 years ago

Hi @jonathonl - thanks very much, this is so helpful! merging the files would be great but just having the vcf manifest unblocks what we are trying to get done today. one additional question, can you confirm the number of TOPMed vcfs available in this dataset? it looks like there are 10246 in the Amazon S3 bucket but there are 10984 in Google. thanks again!

zflamig commented 5 years ago

Thank you!

keanderka commented 5 years ago

@jonathonl also upon closer inspection, it looks like the NWD IDs for the vcfs in the S3 bucket don't match many of the ones in the Google bucket...could you send a manifest of the complete list of vcfs that we should have? thanks!

jonathonl commented 5 years ago

The 10,984 samples in this manifest are the samples from TOPMed freeze 5 call set that are apart of tier1a. The samples on AWS make up the entire freeze 5 call set. For both VCF and CRAM files, the AWS bucket includes many samples beyond tier1a. I'm assuming this explains the discrepancies you are seeing. Does this make sense?

keanderka commented 5 years ago

Hi @jonathonl - this excel sheet might help explain what I'm seeing: TOPMed vcfs - GCP and S3 discrepancies.xlsx

I'm only comparing the vcfs so I'm confused about why GCP has more vcf files if AWS is the bucket that has more samples beyond tier1a?
I'm also assuming that the identifiers that start with "NWD..." are sample IDs and that they should match between the buckets? Or rather, if (as you mention) the AWS bucket should have a greater number of samples beyond tier1a, then everything in GCP should match what is in AWS and my comparison should just show additional unique samples in AWS S3.

We need the vcfs for our tertiary analysis planned this month and I just want to be sure that we have the correct data set.

Thanks for your help!

bheavner commented 5 years ago

Are the AWS and GCP storage locations and the data they contain controlled access? I believe that only people who have been approved for tier1a access should have access to tier1a, and only people who have appropriate IRB and approved access should have access to any other data. If there are freeze 5 vcf files in AWS, is the right access control in place?

keanderka commented 5 years ago

@bheavner we can't list the buckets because of the access controls so we are just going off of the manifests and the manifests are what are inconsistent. What I'm really asking for here is for a "source of truth" of what vcf files are included in the tier1a TOPMed dataset, can you send me an Excel sheet with that information?

jonathonl commented 5 years ago

I do not know who is controlling data access for the s3 bucket these days, but access controls for GCP have been restricted to tier1a for the data commons google group. @keanderka, which s3 manifest are you referring to? The only s3 manifest that I know (/manifest.txt) of does not contain VCF URIs. I guess the more important question is who provided this s3 manifest? I'll need to sort out with them why their list doesn't match ours.

keanderka commented 5 years ago

Hi @jonathonl - the list I'm looking at for S3 is here: https://github.com/dcppc/full-stacks/blob/master/topmed-vcf.tsv and it was emailed by Vivien.

jonathonl commented 5 years ago

Thanks. Though I don't have access to that URL, I tracked down the creator and this should be resolved now. I created a new manifest that includes both reads and genotypes at gs://topmed-irc-share/genomes/manifest.data-commons-tier1a.txt. This list includes a few samples that are not on AWS at the moment, but I expect them to be added to AWS soon. I will delete the VCF-only manifest.

Another thing to note is that ACLs on GCP have only been applied to the year 1 samples in this manifest. The rest of the samples were only recently moved to this bucket. The ACLs for the rest of the tier1a samples are being applied now and will likely take a day to complete.

jonathonl commented 5 years ago

Everything should be resolved now. Let me know if you see any inconsistencies.

jonathancrabtree commented 5 years ago

Hi @jonathonl, is there also a separate manifest file for the TOPMed/dbGaP phenotype files on GCP? On S3, these live under s3://nih-nhlbi-datacommons/phenotype/ (as per the manifest files that Vivien e-mailed with the AWS tier 1A announcement) but I'm not sure who put them there.

jonathonl commented 5 years ago

The phenotype file structure on GCP should be identical to AWS and under the prefix gs://topmed-irc-share/phenotypes/. I'm waiting on confirmation that I can apply tier1a access to all of phenotypes that exist on GCP.

jonathonl commented 5 years ago

Tier1a access has now been applied to the phenotype files on GCP.

jonathancrabtree commented 5 years ago

Great, thank you!

mvucenovic commented 5 years ago

Hi @jonathonl. I've noticed some potential discrepancies when I checked the files from the manifest stored at gs://topmed-irc-share/genomes/manifest.data-commons-tier1a.txt.

If I am not wrong to assume that the cram/crai files on the bucket gs://topmed-irc-share/genomes/ should be mirrors of the files on the bucket s3://nih-nhlbi-datacommons/, then there are at least 201 occurrences where this files do not match in sizes. I've created the tsv file where I've gathered the problematic files. topmed-files-mismatch.txt

If I wrongly assumed that the files should be the same, then ignore this, of course.

jonathonl commented 5 years ago

I'm looking into these. Thanks for bringing to my attention.

jonathonl commented 5 years ago

Alright, I figured out what is going on. The versions on GCP are the correct versions. We had to fix some read group IDs a while back and it seems some of those fixes never got propagated to AWS. We will work on getting AWS updated.

jonathonl commented 5 years ago

@zflamig, can you close this issue?

dcppc / data-stewards

TOPMed VCF Paths on Google #26