broadinstitute / gatk-dataflow

Development dataflow
BSD 3-Clause "New" or "Revised" License
4 stars 1 forks source link

Implement temporary way to write large BAM output in dataflow #21

Open akiezun opened 9 years ago

akiezun commented 9 years ago

From @droazen on July 30, 2015 19:40

Until https://github.com/broadinstitute/hellbender/issues/621 is implemented, we need a quick hack for this in order to effectively test the ReadsPreprocessingPipeline:

-Need a way of converting GATKRead -> SAM record line (https://github.com/broadinstitute/hellbender/issues/618)

-Add option to ReadsPreprocessingPipeline to bypass SmallBamWriter and instead apply a PCollection<GATKRead> -> PCollection<String> transform, do a TextIO.Write().to(output).withSuffix() (might need to disable sharding), then prepend a header, then sort the sam and convert to bam.

Copied from original issue: broadinstitute/hellbender#771

akiezun commented 9 years ago

From @droazen on August 3, 2015 15:1

More details:

Can use hellbender's SortSAM to sort the sam output after concatenation, and hellbender's PrintReads to convert from sam -> bam