[User Story] Fix issue with filtered regions in CNVkit

mathiasbio commented 1 month ago

Need

As a clinician I want to find CNVs with as high resolution as possible. Currently some target capture bedfiles have regions with a very small size, usually 1 base, and which corresponds to some CNV backbone SNV probe-regions, and these are automatically filtered out by CNVkit and ignored in analysis.

See issue: https://github.com/Clinical-Genomics/target_capture_bed/issues/133 in target capture repo And assessment: https://github.com/Clinical-Genomics/BALSAMIC/issues/1466

We need to increase the size of these regions, and ideally for release 16 so that we can build the best possible PONs for the release.

Suggested approach

NEEDS REFINEMENT:

I don't know what the best approach is right now.

At the moment there are many bedfiles that are affected with this issue, but we don't need to prioritise all of them right now. The most relevant bedfiles that I could see are:

twistexomecomprehensive_10.2_hg19_design.bed (with around 62k 1bp regions)
gmcksolid_4.1_hg19_design.bed (with around 6k 1bp regions)

Out of these the exome bedfile is probably most important, especially as I have planned to build a PON for this workflow for release 16.

The problem is how to implement the extension of the bed regions...

Considered alternatives

1. Change the bedfile in the target repo.

Pros:

The bedfile used to build the PON and used in the CNV analysis will be 1 to 1 with the bedfile in the target repo, which is clear and traceable

Cons:

If we cannot do some versioning of the bedfile or make it only balsamic specific, then we would need to coordinate with RD side because (I THINK) they use the same bedfile and they have built a PON on it and everything, so it would probably delay fixing this for balsamic if we needed to go this target_capture repo route.
The SNV-calling would also be increased into perhaps uninteresting regions. This could be more or less significant depending on if we would use the probe-regions (which are about 100 bp) or just adjust the size with a script to like 20 bp just to get around the issue of CNVkit filtering out the regions.

2. Change the bedfile in runtime in balsamic only for the CNVkit analysis

Pros:

It would only affect CNV calling, and we wouldn't need to worry about affecting current QC metrics and SNV calling
We wouldn't need to change the bedfile in the target capture repo and it would then be a faster implementation as we wouldn't need to coordinate with RD

Cons:

We have less control over the CNV calling and PON creation. The bedfile will be pre-processed by increasing the regions to a minimum of let's say 20, and the PON needs to be built using the same pre-processed bed step, which we would need to keep track of in some way.

Deviation

No response

System requirements assessed

[ ] Yes, I have reviewed the system requirements

Requirements affected by this story

No response

Risk assessment needed

[ ] Needed
[ ] Not needed

Risk assessment

No response

SOUPs

No response

Can be closed when

No response

Blockers

No response

Anything else?

No response

mathiasbio commented 1 month ago

After refinement 2024-07-26 we still don't know what the best way forward is.

For option 1, updating the bedfile in the repo there are a few options:

We could add another layer of versioning, such as 10.2.1 where the 10.2 corresponds to the version of the bait-set, whereas the .1 to the version of the bed for the bait-set. This would require some changes to the way CG chooses the bedfile and PON for the balsamic config.
We can simply update the bedfile without changing the version. Keeping it as 10.2, since RD is getting their bedfiles from somewhere else this would be fine. However that would mean that we would end up with two bedfiles with the same names with different contents. Which could get confusing down the line and lead to mistakes. Note by @fevac: we would lose full traceability with this option.
We can change the version to 10.3. But that would mean that we would need to change the version in LIMS for these samples that are going to balsamic, but keep them as 10.2 for RD. It would mean having different bait versions without having changed the bait-set, which could also get confusing. Note by @fevac: Alternatively we can ask RD to upgrade to 10.3 to keep it synced.

For option 2, we could proceed as outlined, by adding some pre-processing script to the bedfile before running CNVkit. It would however mean adding some extra layer of documentation to ensure that we don't update this pre-processing step in the pipeline without also updating the PON. Note by @fevac: this is also not the preferred option if we want these changes and information about these issues to be incorporated in future panels. If we only change it in the balsamic side it would be prone to be lost.

fevac commented 1 month ago

Added some notes above.

Also, would it be too complicated to store the panel file (or file path) in the reference folder per release instead of getting it from the repo (similar to what RD does)? Then maybe we could change it per balsamic release without affecting so many systems in the future (and uncoupling it a bit). But I bet this comes with other issues too

mathiasbio commented 1 month ago

Nice 🙏 as a note on the note on option 2 we could definitely add instructions in the create target panel too, to avoid these situations in the future.

We have a couple of different reference folders. There's the one with the databases like production/cancer/reference and there's the one in the balsamic cache production/cancer/balsamic_cache/[version]

At the moment the bedfiles are retrieved from the repo in the production/cancer/reference folder which I think CG parses for the balsamic config case argument to find the right bedfile. But I guess you're suggesting to add it to the balsamic init argument to download the bedfiles to the balsamic cache together with the other references. I think that sounds like a nice idea if it can be done. It would be nice to the bedfiles that were used in a certain version of balsamic saved somewhere 🤔 but I don't know if I know all the benefits with it! I think @ivadym understands the pros and cons of this better than me

ivadym commented 1 month ago

Yes, MIP uses a centralized config file in Servers, but it still retrieves the panel bed from LIMS to generate its run config, exactly what we are currently doing in Balsamic.

We have a versioning system for the PONs, so an option could be to extend that versioning to the target capture beds as well. For example, taking the latest target capture bed (twistexomecomprehensive_10.2_hg19_design_v2.bed) and matching it to the corresponding PON (twistexomecomprehensive_10.2_hg19_design_v2.bed_CNVkit_PON_reference_v100.cnn). We would only implement this change to feed the Balsamic CLI, and the rest of the code should remain unaffected. I can look more into this if we discard incrementing the versioning in the target capture bed repo option

mathiasbio commented 1 month ago

Hmm so in this example is twistexomecomprehensive_10.2_hg19_design_v2.bed_CNVkit_PON_reference_v100.cn the 100th PON that has been built on the bedfile twistexomecomprehensive_10.2_hg19_design_v2.bed? This sounds like the same strategy as the 10.2.1 idea, but instead of adding a new versioning strategy we're re-using the one that we're doing for PONs? To me this sounds like a nice solution, I think I prefer this to incrementing the minor version since we're so used to the minor in the bedfile name corresponding to a new bait-design in the lab. This way we can preserve this way of versioning and add a new layer for the bedfiles.

mathiasbio commented 1 month ago

I wonder what minimum size to extend to would be best:

This is original, followed by 20bp, followed by 100bp

mathiasbio commented 1 month ago

I don't really know how CNVkit works, but instinctively it feels better if the bed-region is at least somewhat corresponding to the region we're sequencing. Probably it doesn't matter much honestly! Maybe increase the number of variants slightly, and decrease the % off target reads in the QC :D

mathiasbio commented 1 month ago

If we decide to go with the new versioning strategy for these bedfiles I have prepared this PR: https://github.com/Clinical-Genomics/target_capture_bed/pull/134

mathiasbio commented 1 month ago

Sounds like we're going for the strategy of padding dynamically the bedfile before CNVkit instead: https://github.com/Clinical-Genomics/BALSAMIC/pull/1469

Clinical-Genomics / BALSAMIC