Why are most of the calculations done twice?

aeeckhou / shallowHRD

This method uses shallow Whole Genome Sequencing (sWGS) and the segmentation of a genomic profile to assess the Homologous Recombination Deficiency of a tumor based on the number of Large-scale Genomic Alterations (LGAs).

30 stars 13 forks source link

Why are most of the calculations done twice? #10

Closed lbeltrame closed 1 year ago

lbeltrame commented 1 year ago

Looking through the code (very messy and hard to read), at least for the QDNAseq steps, the code does the threshold calculation, then all the steps to merge and refine segments and ratios, then calls LGAs, then re-uses a previously optimized segment table to do a second run of threshold finding, before ultimately calculating LGAs again.

In the ~6000 LoC file, I wasn't able to find an explanation of why the steps are (mostly, not all) repeated twice. And of course the application note on Bioinformatics is extremely light on details. What is the exact purpose of this double step?

aeeckhou commented 1 year ago

Hello,

The optimized segmentation table allow for a more robust call in the second run of the algorithm. The first iteration creates an optimized segment table, based on which a better call of the CNA cut-off can be done and used to re-optimize the segmentation profile from the beginning. It might seems trivial but when working with the numerous clinical samples that I analyzed, it helped with low quality cases where the CNA cut-off was sometimes not called correctly.

The application note was based on the ShallowHRD v1.0, that was since optimized a lot. That is why it is not mentionned in the Application Note of Bioinformatics (that limitates a lot the number of pages you can write) as it was implemeted afterwards.

Of note the lab kept on improving the method since the 1.13 version that is in this github. This improved 2.0 version will be published sometimes this year.

Best regards, Alexandre Eeckhoutte

lbeltrame commented 1 year ago

Hello Alexandre,

thanks for the reply. Is the rationale the same for the merging operations done on the segments < 3kbp? I believe the "first iteration" (let's call it like that) merges segments in a certain way, but the second, after all the other rounds of threshold finding, is slightly different, am I right? (I'm talking about the larger while loops that do all the comparisons on where to place the segments and how to merge them).

aeeckhou commented 1 year ago

Hello,

For the 3kbp, I think you mean 3 Mbp! Below segments are considered small segments and above they are considered large segments. Small segments are composed of less reads and are less stable in terms of value than large segments. Small segments are integrated afterwards, to the large segments in an iteration, in a similar but slightly different fashion than large segments together.

The two iterations (first one to find an optimized CNA cut-off and second one to go all the way to the final segmentation and LGAs call) are mostly the same to small details.

Best, Alexandre Eeckhoutte

lbeltrame commented 1 year ago

Right, sorry! I had something else in mind while writing the reply. Thanks for the explanation. I guess this can be closed.