Qsignature needs to be able to deal with positions files that are larger than the 1.4 million positions that it was originally designed to operate with.
A recent positions file that was created based on a GRCh37 gene model file had around 120 million positions.
Traditionally, qsignature would load all of the positions into memory (as VcfRecords) and then walk the input file keeping tallies of the bases observed.
This approach is not possible when dealing with 120 million positions as a large amount of memory would be required.
And so, some changes to allow the positions file to be streamed have been introduced.
A positions package has been added, and in there is a PositionIterator abstract class which implements Iterable<ChrPosition>
There are currently 3 classes that extend this abstract class:
VcfStreamPositionIterator
VcfInMemoryPositionIterator
GeneModelInMemoryPositionIterator
They all access the underlying positions data in a different way, but provide a uniform interface with which the calling classes can access the positions.
There is a sort method in the PositionIterator that currently only the VcfInMemoryPositionIterator class implements. This allows the positions to be sorted in the contig order present in the BAM file.
It is not currently possible using our in house file readers to read the data in a sorted manner, even if the positions file is sorted and tabix'ed.
We may need to look into htsjdk's VCFFileReader.java class which provides a query(final String chrom, final int start, final int end)method returning an iterator over the returned records.
A new option (--stream) has been added that will tell the SignatureGeneratorBespoke class which of the implementations to use.
Note that the positions are now being stored as ChrPosition objects rather than VcfRecord objects, purely to save space.
Type of change
Please delete options that are not relevant.
[X] New feature (non-breaking change which adds functionality)
[X] This change requires a documentation update
How Has This Been Tested?
New unit tests have been added, and existing unit tests have been added to and modified.
qsig vcf files have been created and compared against existing qsig vcf files and have been found to be the same.
Running the Compare process against qsig vcfs generated by the updated code has given the same results as that generated by the existing code.
When running the new process against the same positions file, the md5sum generated (and put into the header) is the same as when run using the existing code.
Are WDL Updates Required?
qsignatureGeneratorBespoke.wdl task will need to be updated to allow for the addition of an optional boolean that will trigger the presence of the --stream option.
Having said that, this is a non breaking change. The SDFTM wdl workflow does not require any modifications for the new code to function.
Checklist:
[X] My code follows the style guidelines of this project
[X] I have performed a self-review of my own code
[X] I have commented my code, particularly in hard-to-understand areas
[X] I have made corresponding changes to the documentation
[X] My changes generate no new warnings
[X] I have added tests that prove my fix is effective or that my feature works
[X] New and existing unit tests pass locally with my changes
Description
Qsignature needs to be able to deal with positions files that are larger than the 1.4 million positions that it was originally designed to operate with. A recent positions file that was created based on a GRCh37 gene model file had around 120 million positions.
Traditionally, qsignature would load all of the positions into memory (as
VcfRecords
) and then walk the input file keeping tallies of the bases observed. This approach is not possible when dealing with 120 million positions as a large amount of memory would be required.And so, some changes to allow the positions file to be streamed have been introduced.
A positions package has been added, and in there is a
PositionIterator
abstract class which implementsIterable<ChrPosition>
There are currently 3 classes that extend this abstract class:VcfStreamPositionIterator
VcfInMemoryPositionIterator
GeneModelInMemoryPositionIterator
They all access the underlying positions data in a different way, but provide a uniform interface with which the calling classes can access the positions.
There is a
sort
method in thePositionIterator
that currently only theVcfInMemoryPositionIterator
class implements. This allows the positions to be sorted in the contig order present in the BAM file.It is not currently possible using our in house file readers to read the data in a sorted manner, even if the positions file is sorted and tabix'ed.
We may need to look into htsjdk's VCFFileReader.java class which provides a
query(final String chrom, final int start, final int end)
method returning an iterator over the returned records.A new option (
--stream
) has been added that will tell theSignatureGeneratorBespoke
class which of the implementations to use.Note that the positions are now being stored as
ChrPosition
objects rather thanVcfRecord
objects, purely to save space.Type of change
Please delete options that are not relevant.
How Has This Been Tested?
New unit tests have been added, and existing unit tests have been added to and modified.
qsig vcf files have been created and compared against existing qsig vcf files and have been found to be the same. Running the Compare process against qsig vcfs generated by the updated code has given the same results as that generated by the existing code. When running the new process against the same positions file, the md5sum generated (and put into the header) is the same as when run using the existing code.
Are WDL Updates Required?
qsignatureGeneratorBespoke.wdl
task will need to be updated to allow for the addition of an optional boolean that will trigger the presence of the--stream
option. Having said that, this is a non breaking change. The SDFTM wdl workflow does not require any modifications for the new code to function.Checklist: