broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

NIO version of CNV WDLs. #4806

Open samuelklee opened 6 years ago

samuelklee commented 6 years ago

Just realized CNV WDLs are not using NIO in FireCloud. This is as simple as changing File to String for supported files. Not sure if these need to live in our repo (I see we have a M2 NIO WDL), I'd be fine with them just living in FireCloud.

@bshifaw would you be OK making the changes in FireCloud for the next release? If not, I can add an NIO version to the repo.

@LeeTL1220 perhaps something to add to the style guide (if it's not already there)?

LeeTL1220 commented 6 years ago

I assume this relates to #4728

LeeTL1220 commented 6 years ago

Something is discussed in the style guide about NIO.

samuelklee commented 6 years ago

What is the timeline for portable WDL-NIO?

This would make things like performing preliminary analyses on subsets of contigs, etc. go a bit faster. I agree that this is not a common use case, but since it's such a small amount of work, I don't see the harm. Would be nice to be consistent with other Featured WDLs, if they're all using NIO as well (is this true?)

samuelklee commented 6 years ago

For example, I'm running some WGS tests on chr20-22. The BAMs take ~1hr+ to localize but it only takes ~5 minutes to collect read/allelic counts. Getting preempted adds a huge amount of unnecessary overhead.

bshifaw commented 6 years ago

Sure, but i'm not familiar with what the supported files would be. Would this be any input, reference, and resources files in google buckets? Also, just glancing over the mutect2 nio looks like File is changed to String only in the task block, true?

samuelklee commented 6 years ago

Some of the files in the CNV workflows are not yet supported by NIO, so it won’t be as easy as changing File to String everywhere. I think the ref and the BAMs are the most important ones to change, as Lee pointed out.

Let’s not worry too much about getting this in the next release or anything like that—-as soon as is convenient is fine. But let’s decide on things like whether we need a copy in this repo and perhaps have a general strategy going forward.

droazen commented 6 years ago

@samuelklee Let us know if you want a quick consult on converting your tools to NIO -- happy to help!

samuelklee commented 6 years ago

Yup, this is tied in to the discussion of tabular data we touched on in https://github.com/broadinstitute/gatk/issues/4717.

CNV tabular data/files are represented as metadata (which is a SAM header that includes sample name and/or sequence dictionary) and records (which may or may not be locatable). We can probably make do with just NIO support for locatable data, since these tend to be the largest and are currently the ones that are needlessly localized.

Let's discuss more in person!

samuelklee commented 6 years ago

@LeeTL1220 Actually, since the time for coverage collection itself is comparable to the time it takes to localize the BAM, this probably also helps even when you're not just running on snippets.

Building a PoN on chr20-22 took ~2.5 hours without NIO (3/50 BAMs were preempted, some more than once), but it only took ~40 minutes with an NIO WDL I quickly whipped up (and 1/4 of that was because I was unlucky enough to get preempted on one BAM after 10 minutes)!

sooheelee commented 6 years ago

FYI, next release of Cromwell is supposed to automatically interpret File types as Sting when given a gs:// bucket file.

Sent from an iPhone and typed with my thumbs.

On May 24, 2018, at 3:48 PM, samuelklee notifications@github.com wrote:

@LeeTL1220 Actually, since the coverage collection itself is comparable to the time it takes to localize the BAM, this probably also helps even when you're not just running on snippets.

Building a PoN on chr20-22 took ~2.5 hours without NIO (3/50 BAMs were preempted, some more than once), but it only took ~40 minutes with an NIO WDL I quickly whipped up (and 1/4 of that was because I was unlucky enough to get preempted on one BAM after 10 minutes)!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

LeeTL1220 commented 6 years ago

That would be really bad. Are you sure?

On Fri, May 25, 2018, 08:30 sooheelee notifications@github.com wrote:

FYI, next release of Cromwell is supposed to automatically interpret File types as Sting when given a gs:// bucket file.

Sent from an iPhone and typed with my thumbs.

On May 24, 2018, at 3:48 PM, samuelklee notifications@github.com wrote:

@LeeTL1220 Actually, since the coverage collection itself is comparable to the time it takes to localize the BAM, this probably also helps even when you're not just running on snippets.

Building a PoN on chr20-22 took ~2.5 hours without NIO (3/50 BAMs were preempted, some more than once), but it only took ~40 minutes with an NIO WDL I quickly whipped up (and 1/4 of that was because I was unlucky enough to get preempted on one BAM after 10 minutes)!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/4806#issuecomment-392042783, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDXk5K6EpZQft219BT7K8DoONgZqx2hks5t1_lQgaJpZM4UMBDI .

samuelklee commented 6 years ago

Actually, I'm noticing that while using NIO for the BAM for read/allelic-count collection is usually much more efficient, using NIO for the reference in other tasks can be much slower. Perhaps the access patterns for the reference (hitting ~10^5 intervals for WES PreprocessIntervals/AnnotateIntervals and ~10^6 sites for WES/WGS CollectAllelicCounts, respectively) make localization a better strategy? @droazen does that sound right to you?

LeeTL1220 commented 6 years ago

@sooheelee I just confirmed that is not the case. Have not confirmed the exact change yet.

sooheelee commented 6 years ago

There is a video recording of Chris L saying so. I’ll get the link to it in a bit. Better yet, I'd like to hear about updates to these plans in person.

Sent from an iPhone and typed with my thumbs.

On May 25, 2018, at 9:18 AM, Lee Lichtenstein notifications@github.com wrote:

@sooheelee I just confirmed that is not the case. Have not confirmed the exact change yet.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

cjllanwarne commented 6 years ago

There's a change in progress that will let you "opt in" to have Cromwell treat a specific File input as a String for the purposes of localization and as a File for the purposes of call caching. It'll probably look something like (but this isn't necessarily the final syntax final yet) this:

task foo {
  input {
    File nio_me
    File dont_nio_me
  }
  parameter_meta {
    nio_me: {
      leave_in_cloud: true
    }
  }

  command {
    cat ~{dont_nio_me}
    java -jar gatk.jar --with-nio -i ~{nio_me}
  }
}
cjllanwarne commented 6 years ago

But this is still in progress so won't be in Cromwell 32 releasing today

sooheelee commented 6 years ago

Thanks @cjllanwarne for the clarification. @LeeTL1220, are you referring to the same feature or perhaps something else?

LeeTL1220 commented 6 years ago

I have talked with @cjllanwarne about this before and his post captures what I understood. This is not what I gathered from your (@sooheelee) previous post. Happy that plans have not changed.

On Fri, May 25, 2018, 11:17 sooheelee notifications@github.com wrote:

Thanks @cjllanwarne https://github.com/cjllanwarne for the clarification. @LeeTL1220 https://github.com/LeeTL1220, are you referring to the same feature or perhaps something else?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/4806#issuecomment-392090975, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDXk0lhicfoSWcjmbGJq6zSZ5FH4bo-ks5t2CCegaJpZM4UMBDI .

sooheelee commented 6 years ago

@LeeTL1220, Chris is also my source of information. Seems there's been a misunderstanding of what is meant by "automatic" in that we will have to opt-in to the feature.

LeeTL1220 commented 6 years ago

@sooheelee Yeah, but we (WDL authors) need to opt-in per parameter. There are still files that we would want to localize even if a gs url is provided.

sooheelee commented 5 years ago

@cjllanwarne, how do I opt-in per parameter? I am trying to run the v4.0.11.0 gCNV WDL using Cromwell v36 and even after changing File type to String type for the relevant BAM/BAI inputs, I am getting a file-type not recognized error.

sooheelee commented 5 years ago

Nevermind Chris. I've found the documentation for the feature at https://cromwell.readthedocs.io/en/stable/optimizations/FileLocalization/.

sooheelee commented 5 years ago

I spoke too soon. No matter how I define the type, whether String or File, when I run womtools validate I get the error:

ERROR: Value for this attribute is expected to be a string:

        bam: {

I added the following parameter_meta field to the task:

screenshot 2018-11-08 18 02 03

How to correct this @cjllanwarne? Here's the relevant WDL: https://github.com/broadinstitute/gatk/blob/4.0.11.0/scripts/cnv_wdl/cnv_common_tasks.wdl

samuelklee commented 5 years ago

@sooheelee I found that on FC, it was only necessary to make the File -> String change at the task level (i.e., in CollectCounts). Not sure if this works due to the particular version of Cromwell on FC. In any case, I don't think you should stress too much about getting NIO working on your VM. Again, I'd expect the current non-NIO gCNV WDL to only take a few hours on FC, and then less than that once coverage collection is call cached. No use spending more time getting NIO working than it would save you, after all!

sooheelee commented 5 years ago

Thanks @samuelklee. I'm just waiting to be added to the broad-firecloud-dsde billing account to be able to use FireCloud.

sooheelee commented 5 years ago

I'm informed that I have access to the broad-firecloud-dsde billing account and it appears whatever error we encountered the other day was something transient.

mwalker174 commented 3 years ago

NIO streaming is currently used to avoid BAM localization in CollectCounts and CollectAllelicCounts, which is particularly beneficial when running on a limited set of intervals. It is also used in JointSegmentation to stream VCFs, which saves on disk space.

Additional tasks that could benefit from streaming are those that use counts files inputs: FilterIntervals, GermlineCNVCallerCaseMode, GermlineCNVCallerCohortMode, DetermineGermlineContigPloidyCaseMode, and DetermineGermlineContigPloidyCohortMode, CreateReadCountPanelOfNormals, and DenoiseReadCounts. These would require that read counts are in TSV format, which we should move to using exclusively (instead of HDF5).

samuelklee commented 3 years ago

@asmirnov239 note https://github.com/broadinstitute/gatk/pull/6266.