This PR contains 2 new classes to create unmapped BAMs from fastq files, and to create fastq files from mapped BAM/CRAM files.
They are effectively clones of picard's classes (FastqToSam and SamToFastq).
What sets them apart from the picard classes is that they will attempt to preserve any additional information in the FastqRecord header in the SAMRecord (FastqToSamWithHeaders).
When writing back to fastq (SamToFastqWithHeaders), these additional headers (if present) will be returned to the FastqRecord header. This means that it should be possible to accurately recreate the fastq files that were used to make the BAM/CRAM, which means that it should be possible to delete the original fastq files, which means that disk space should be saved.
FastqToSamWithHeaders will write any additional header information into 2 user defined tags:
ZH - for any extra Header information
ZT - for any bases and qualities that have been Trimmed from the read (a separate process is responsible for trimming the bases and adding them to the FastqRecord header)
These tags are then either added back to the header (ZH) or added back to the read and quality (ZT) when SamToFastqWithHeaders is called.
Type of change
[X] New feature (non-breaking change which adds functionality)
How Has This Been Tested?
New unit test classes have been included as part of this PR.
These classes have been included as part of a modified FTUB_WGGSS wfl, with the same number of hard filtered vcf records produced (on the GS NA12878 dataset) as the existing FTUB_WGGSS wfl. The CRAM/BAM can be converted back to a fastq file with the same records as the original.
Are WDL Updates Required?
No wdl updates are required, although the expectation is that once this is in production, the FTUB_WGGSS wfl (or a new one based on that) will be update to call these new classes.
Checklist:
[X] My code follows the style guidelines of this project
[X] I have performed a self-review of my own code
[X] I have commented my code, particularly in hard-to-understand areas
[X] My changes generate no new warnings
[X] I have added tests that prove my fix is effective or that my feature works
[X] New and existing unit tests pass locally with my changes
Description
This PR contains 2 new classes to create unmapped BAMs from fastq files, and to create fastq files from mapped BAM/CRAM files. They are effectively clones of picard's classes (
FastqToSam
andSamToFastq
).What sets them apart from the picard classes is that they will attempt to preserve any additional information in the
FastqRecord
header in theSAMRecord
(FastqToSamWithHeaders
). When writing back to fastq (SamToFastqWithHeaders
), these additional headers (if present) will be returned to theFastqRecord
header. This means that it should be possible to accurately recreate the fastq files that were used to make the BAM/CRAM, which means that it should be possible to delete the original fastq files, which means that disk space should be saved.FastqToSamWithHeaders
will write any additional header information into 2 user defined tags:ZH
- for any extra Header informationZT
- for any bases and qualities that have been Trimmed from the read (a separate process is responsible for trimming the bases and adding them to theFastqRecord
header)These tags are then either added back to the header (
ZH
) or added back to the read and quality (ZT
) whenSamToFastqWithHeaders
is called.Type of change
How Has This Been Tested?
New unit test classes have been included as part of this PR. These classes have been included as part of a modified
FTUB_WGGSS
wfl, with the same number of hard filtered vcf records produced (on the GS NA12878 dataset) as the existingFTUB_WGGSS
wfl. The CRAM/BAM can be converted back to a fastq file with the same records as the original.Are WDL Updates Required?
No wdl updates are required, although the expectation is that once this is in production, the
FTUB_WGGSS
wfl (or a new one based on that) will be update to call these new classes.Checklist: