dcppc / data-stewards

Questions and answers about TOPmed, GTEx, and AGR resources.
8 stars 0 forks source link

GTEx XML submission for SRA/SDDP #18

Open webermn opened 6 years ago

webermn commented 6 years ago

To facilitate access to GTEx data to the DCPPC, can the Broad Data Steward team please submit XML to the NCBI Sequence Read Archive (SRA) Sequence Data Delivery Pilot (SDDP) for all GTEx data that is being shared with the Data Commons Consortium?

The submission should describe the data and its location on both the Google and Amazon clouds using the XML schema as described in the attached PDF and in these examples: ftp://ftp.ncbi.nih.gov/sra/examples/cloud_examples/

Adam Stine from NCBI (stineaj@ncbi.nlm.nih.gov) is the point of contact for questions regarding this submission and can assist with linking this submission to existing GTEx records. I'm also happy to set up a meeting with relevant folks from Broad and elsewhere to discuss in more detail.

Would a reasonable target for task completion be, say, sometime next week?

cc/ @francois-a @jnedzel @saulakravitz SRA-XMLCloudFormatGuide-250418-1832-37.pdf

owhite commented 6 years ago

Nick - out of curiosity could you outline how this plays a role with the connection of DCPPC data and GTEx? just curious about the mechanics of what's happening.

clarisca commented 6 years ago

@webermn : where can we find more information about "NCBI Sequence Read Archive (SRA) Sequence Data Delivery Pilot (SDDP)" ? Is this a pilot developed for the DCPPC? @krobasky

krobasky commented 6 years ago

I actually gave my bioinformatics students a practicum on SRA so they would be familiar with the very important sratool

I had students download NA12878 reads (human, from the “genome in a bottle” project), but I guess SDDP allows you to test SRA tools with other real human data (public 1kG data, if I correctly understand) https://www.ncbi.nlm.nih.gov/bioproject/416033

Are there credentials publicly available to test SRA tools against 1kG (SDDP)?

Thanks!

On May 2, 2018, at 8:50 PM, clarisca notifications@github.com<mailto:notifications@github.com> wrote:

@webermnhttps://github.com/webermn : where can we find more information about "NCBI Sequence Read Archive (SRA) Sequence Data Delivery Pilot (SDDP)" ? Is this a pilot developed for the DCPPC? @krobaskyhttps://github.com/krobasky

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dcppc/data-stewards/issues/18#issuecomment-386164337, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAZN7v_sZ_gYb-woruNQHrX9Oo9401jjks5tulQlgaJpZM4TuwDH.

saulakravitz commented 6 years ago

Hi Folks, Steve Sherry (NCBI) and I lead the SDDP effort. Perhaps we could arrange a time for an introduction and live demo on AWS or GCP? Data that is currently available includes some TopMED studies (controlled access) and 1000 genomes data (open access, no credentials needed). Inclusion of other data sets is in process. (BTW: via SDDP you get access to the aligned data directly as files, so you can use your familiar tools (samtools, gatk, whatever) directly on the data.))

Regards, Saul

From: Kimberly Robasky notifications@github.com Reply-To: dcppc/data-stewards reply@reply.github.com Date: Wednesday, May 2, 2018 at 9:18 PM To: dcppc/data-stewards data-stewards@noreply.github.com Cc: "Kravitz, Saul A." saul@mitre.org, Mention mention@noreply.github.com Subject: Re: [dcppc/data-stewards] GTEx XML submission for SRA/SDDP (#18)

I actually gave my bioinformatics students a practicum on SRA so they would be familiar with the very important sratool

I had students download NA12878 reads (human, from the “genome in a bottle” project), but I guess SDDP allows you to test SRA tools with other real human data (public 1kG data, if I correctly understand) https://www.ncbi.nlm.nih.gov/bioproject/416033

Are there credentials publicly available to test SRA tools against 1kG (SDDP)?

Thanks!

On May 2, 2018, at 8:50 PM, clarisca notifications@github.com<mailto:notifications@github.com> wrote:

@webermnhttps://github.com/webermn : where can we find more information about "NCBI Sequence Read Archive (SRA) Sequence Data Delivery Pilot (SDDP)" ? Is this a pilot developed for the DCPPC? @krobaskyhttps://github.com/krobasky

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dcppc/data-stewards/issues/18#issuecomment-386164337, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAZN7v_sZ_gYb-woruNQHrX9Oo9401jjks5tulQlgaJpZM4TuwDH.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dcppc/data-stewards/issues/18#issuecomment-386167808, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AjfikwmzHXYrgF3oMCJJIP3OhK0xZVkYks5tulrdgaJpZM4TuwDH.

krobasky commented 6 years ago

Yes please to SDDP demo!

webermn commented 6 years ago

@owhite / @clarisca / @krobasky:

Thanks for your interest. I agree that getting more information out on this will be useful.

Hopefully the comment from @saulakravitz provides some additional context about the Sequence Data Delivery Pilot and the associated tools on AWS and Google (including examples with TOPMed and 1000 Genomes data), but perhaps we can also consider some or all of the following as well for the DCPPC:

  1. Determine for which group(s) an SDDP presentation/demo/discussion would be useful. (Perhaps full stacks and KC6 initially? Any others?)

  2. Have those who are interested review available docs and materials (see list below) in advance of a potential meeting and live demo

  3. Figure out a way to more broadly socialize aspects of data access and management; SDDP is one piece, but it may help to consider it alongside other approaches and to determine how to engage more than just those who subscribe to notifications on this issues list. (For one, I think it would be great to learn what others in the Consortium are already doing that could offer alternatives/improvements in this area.)

I hope this helps. I’m glad to find time to discuss further, and welcome thoughts on how to do that efficiently and with the right audience(s).

krobasky commented 6 years ago

Two Questions:

  1. If I understand correctly these tools are for working with FASTQs, but aren't the Full Stacks intended to work only with the VCF's for the TOPMed data? I ask because it changes the scale considerably - e.g., I see a single study with 90TBs of runs data

  2. What does the following error mean?:

Following along from the slides, I downloaded fusera-linux-amd64 and gave it a try from my Data Commons-provisioned AWS VM. I found an NA12878 run that's hosted in an S3 bucket (e.g., DATAStore Location in the Run Selector = gs.US s3.us-east-1). The run is SRR944152, which I put in topmed.txt and ran: ./fusera-linux-amd64 mount --acc-file topmed.txt mnt Should that work? I got an error:

invalid arguments: gave location of s3.us-east-2, location must match one of these possibilities:

================
gs.[region]
================

regions for gs:
----------------

US

us-east1-b us-east1-c us-east1-d

us-east4-a us-east4-b us-east4-c

us-central1-a us-central1-b us-central1-c us-central1-f

us-west1-a us-west1-b us-west1-c

================
s3.[region]
================

regions for s3:
----------------

us-east-1

================
For accessing files on ncbi, use the location ftp-ncbi
================

starting fusera with given arguments failed, please review the help with -h
saulakravitz commented 6 years ago

Hi Kimberly, Thanks for your questions.

  1. “If I understand correctly these tools are for working with FASTQs, but aren't the Full Stacks intended to work only with the VCF's for the TOPMed data? I ask because it changes the scale considerably - e.g., I see a single study with 90TBs of runs data”

The SDDP supports sequence data types (fastq, bam, cram) and their associated indices, as well as genotype data types (vcf, bcf and their associated indices). The submitted TOPMed data included cram/crai and vcf/csi files for individual samples. The main constraint of the SDDP is that data must be amenable to the dbGaP consent model, so any data files that derive from an individual sample will fit. When you launch fusera, you have ACCESS to tons of data, but not a single byte has been transferred/copied. The data remains in cloud storage, but is presented to the user as a file system. Only when your program actually operates on the mounted files will bytes be streamed to your VM. A fusera file system with 1000 accessions worth of TopMED data can be mounted on AWS within about 20s (GCE performance will soon approach AWS performance).

2) What does the reported fusera error mean? This is a subtle issue and our documentation and error messages need to address this better. The API underlying fusera issues constrains access to data objects to VMs within the same region as the data. In other words, if the data is in us-east-1, you can only access it within us-east-1. The data objects you are trying to reach are in us-east-1. Fusera detects the location of your VM (us-east-2) and since the data for this accession is not available in that region, reports an error. The google copy of this dataset is in a multi-regional bucket, so any of the US regions can access the data.

Try running with a VM in S3.us-east-1 or GS.us.

The reason for this behavior is that moving data between regions isn’t free. The signedURL-based architecture of SDDP puts the burden of any costs on the issuer of the signed URL, in this case the SDDP account. One of the goals of the SDDP is to enable the users to bring their compute to the (freely) provisioned data, with a predictable cost to the SDDP. This constraint is based on a policy decision. It would reduce the complexity of our implementation to eliminate this constraint.

Regards, Saul

From: Kimberly Robasky notifications@github.com Reply-To: dcppc/data-stewards reply@reply.github.com Date: Thursday, May 3, 2018 at 6:06 PM To: dcppc/data-stewards data-stewards@noreply.github.com Cc: "Kravitz, Saul A." saul@mitre.org, Mention mention@noreply.github.com Subject: Re: [dcppc/data-stewards] GTEx XML submission for SRA/SDDP (#18)

Two Questions:

  1. If I understand correctly these tools are for working with FASTQs, but aren't the Full Stacks intended to work only with the VCF's for the TOPMed data? I ask because it changes the scale considerably - e.g., I see a single study with 90TBs of runs data
  2. What does the following error mean?:

Following along from the slides, I downloaded fusera-linux-amd64 and gave it a try from my Data Commons-provisioned AWS VM. I found an NA12878 run that's hosted in an S3 bucket (e.g., DATAStore Location in the Run Selector = gs.US s3.us-east-1). The run is SRR944152, which I put in topmed.txt and ran: ./fusera-linux-amd64 mount --acc-file topmed.txt mnt Should that work? I got an error:

invalid arguments: gave location of s3.us-east-2, location must match one of these possibilities:

================

gs.[region]

================

regions for gs:


US

us-east1-b us-east1-c us-east1-d

us-east4-a us-east4-b us-east4-c

us-central1-a us-central1-b us-central1-c us-central1-f

us-west1-a us-west1-b us-west1-c

================

s3.[region]

================

regions for s3:


us-east-1

================

For accessing files on ncbi, use the location ftp-ncbi

================

starting fusera with given arguments failed, please review the help with -h

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dcppc/data-stewards/issues/18#issuecomment-386451460, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Ajfik6hgB6RWXwhjwBPDqWeC9Q_KLos6ks5tu38kgaJpZM4TuwDH.

krobasky commented 6 years ago

Excellent, thorough answers, thank you!

Regarding 1) Having access to TOPMed FASTQs opens up a lot of possibilities. Meanwhile, I've tried analyzing data on fuse-mounted S3's and it always winds up disappearing; seemingly the i/o can't keep up - has fusera been designed to overcome those challenges or should we build accommodations into the analytical tools?

Regarding 2) 👍 I'm not sure how I wound up on us-east-2, but you're right - I've switched over to us-east-1 to try again - thanks!

So now it hangs... I don't mean to hijack this thread, is there a github issue tracker where I should log this? -- either way, thanks for your help!

$ mkdir mnt
$ time ./fusera-linux-amd64 mount --acc-file topmed.txt mnt
^C
real    13m10.035s
user    0m0.100s
sys     0m0.000s
saulakravitz commented 6 years ago

Hi Kimberly, Yes, there is an issue tracker on githubhttps://github.com/mitre/fusera/issues, or you can contact Matt Bianchi (mbianchi@mitre.orgmailto:mbianchi@mitre.org) who developed fusera, or myself (saul@mitre.orgmailto:saul@mitre.org)

Regarding performance: We have done some initial, limited, benchmarking using fusera and it performs well. I’d be interested in your feedback once you get going.

Regarding hanging: fusera is a FUSE file system, so its process must be running to service file operations against the mounted file system. You can start it running in the background. When the file system is ready, a .initialized file is created in the mounted file system, so you can wait for that, see the wiki for an explanationhttps://github.com/mitre/fusera/wiki/Scripting-with-Fusera.

Regards, Saul

From: Kimberly Robasky notifications@github.com Reply-To: dcppc/data-stewards reply@reply.github.com Date: Thursday, May 3, 2018 at 7:27 PM To: dcppc/data-stewards data-stewards@noreply.github.com Cc: "Kravitz, Saul A." saul@mitre.org, Mention mention@noreply.github.com Subject: Re: [dcppc/data-stewards] GTEx XML submission for SRA/SDDP (#18)

Excellent, thorough answers, thank you!

Regarding 1) Having access to TOPMed FASTQs opens up a lot of possibilities. Meanwhile, I've tried analyzing data on fuse-mounted S3's and it always winds up disappearing; seemingly the i/o can't keep up - has fusera been designed to overcome those challenges or should we build accommodations into the analytical tools?

Regarding 2) 👍 I'm not sure how I wound up on us-east-2, but you're right - I've switched over to us-east-1 to try again - thanks!

So now it hangs... I don't mean to hijack this thread, is there a github issue tracker where I should log this? -- either way, thanks for your help!

$ mkdir mnt

$ time ./fusera-linux-amd64 mount --acc-file topmed.txt mnt

^C

real 13m10.035s

user 0m0.100s

sys 0m0.000s

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dcppc/data-stewards/issues/18#issuecomment-386466441, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Ajfik69__3ZbUVKh8pLs3xxcAc7WKMboks5tu5JIgaJpZM4TuwDH.