AdamaJava / adamajava

Other
14 stars 5 forks source link

qsignature - allow positions file to be streamed #266

Closed holmeso closed 3 years ago

holmeso commented 3 years ago

Description

Qsignature needs to be able to deal with positions files that are larger than the 1.4 million positions that it was originally designed to operate with. A recent positions file that was created based on a GRCh37 gene model file had around 120 million positions.

Traditionally, qsignature would load all of the positions into memory (as VcfRecords) and then walk the input file keeping tallies of the bases observed. This approach is not possible when dealing with 120 million positions as a large amount of memory would be required.

And so, some changes to allow the positions file to be streamed have been introduced.

A positions package has been added, and in there is a PositionIterator abstract class which implements Iterable<ChrPosition> There are currently 3 classes that extend this abstract class:

They all access the underlying positions data in a different way, but provide a uniform interface with which the calling classes can access the positions.

There is a sort method in the PositionIterator that currently only the VcfInMemoryPositionIterator class implements. This allows the positions to be sorted in the contig order present in the BAM file.

It is not currently possible using our in house file readers to read the data in a sorted manner, even if the positions file is sorted and tabix'ed.

We may need to look into htsjdk's VCFFileReader.java class which provides a query(final String chrom, final int start, final int end)method returning an iterator over the returned records.

A new option (--stream) has been added that will tell the SignatureGeneratorBespoke class which of the implementations to use.

Note that the positions are now being stored as ChrPosition objects rather than VcfRecord objects, purely to save space.

Type of change

Please delete options that are not relevant.

How Has This Been Tested?

New unit tests have been added, and existing unit tests have been added to and modified.

qsig vcf files have been created and compared against existing qsig vcf files and have been found to be the same. Running the Compare process against qsig vcfs generated by the updated code has given the same results as that generated by the existing code. When running the new process against the same positions file, the md5sum generated (and put into the header) is the same as when run using the existing code.

Are WDL Updates Required?

qsignatureGeneratorBespoke.wdl task will need to be updated to allow for the addition of an optional boolean that will trigger the presence of the --stream option. Having said that, this is a non breaking change. The SDFTM wdl workflow does not require any modifications for the new code to function.

Checklist: