andersen-lab / Freyja

Depth-weighted De-Mixing
BSD 2-Clause "Simplified" License
100 stars 29 forks source link

depthcutoff Questions #167

Closed whottel closed 11 months ago

whottel commented 11 months ago

Hello,

I am interested in understanding more about what is going on with the --depthcutoff parameter added in version 1.4.5 and choosing a better default value at least in my case. I noticed that some samples produced the following error in v1.4.5 "demix: Solver error encountered, most likely due to insufficient sequencing depth. Try increasing the --depthcutoff parameter." These same samples would generate a demix file as normal with v1.4.4. And depending on the chosen cutoff in version 1.4.5 the top lineages vary, and produce a different result than v1.4.4. Please find attached depths and variants files of an example impacted sample, and a summary table showing the demix output from different version/parameters. 2312877-SC2WW-IA-VH01284-230809_S82_variants.tsv.xlsx 2312877-SC2WW-IA-VH01284-230809_S82_depths.tsv.xlsx freyja_demix_comparison.xlsx

I am thinking by default I should not use the value 0 for the cutoff parameter as this causes previously acceptable samples (freyja coverage 81.35% in this case) to fail.

Thanks, Wes

dylanpilz commented 11 months ago

Hey Wes,

Thanks for raising this and providing example inputs/outputs for the different versions!

--depthcutoff was added to address an issue users were experiencing with the solver failing to converge on a solution for low coverage samples with the given barcodes file. We regularly update this barcodes file with new lineages as they come up, so it's a good idea to occasionally run freyja update to ensure you have the most recent barcodes.

In some instances (#137), the sample coverage is high enough to work with versions of the barcode file, but fails with others. My guess is that when you updated to v1.4.5, it came with an up-to-date usher_barcodes.csv, which is now resulting in a Solver Error despite it working fine with the previous barcode version. Could you verify that you're running demix with the same barcodes file between the two versions? To get an earlier barcode file, select a previous "updating barcodes and metadata" commit here, and download the corresponding freyja/data/usher_barcodes.csv. You can then pass this custom barcode file into demix via the --barcodes option. I'll try to reproduce the error as well using a few different barcode versions.

To your point regarding the varying output for when using different values for --depthcutoff, this option finds SNVs in the barcode file where the sequencing depth is below the specified cutoff value. These sites are then removed from the barcodes, which in many cases results in multiple lineages having the same barcode. These lineages are subsequently grouped into higher-order barcodes based on their shared phylogeny. For your sample using --depthcutoff 30, XBB.1.5.24 and XBB.1.5.28 are being grouped into XBB.1.5-like. However, when you use --depthcutoff 10, the two sub-lineages are still distinguishable from one another, resulting in them both being listed in the output.

Your chosen --depthcutoff value is essentially going to be a tradeoff between accuracy and specificity in the final lineage classifications, but 10 should be a reasonable place to start.

-Dylan

whottel commented 11 months ago

Hi Dylan,

Thanks for your response. I do have a follow-up question. Would it be the case that in previous versions, before the --depthcutoff option was added that variant sites with only 1x read depth would be included in the lineage abundance calculation or was there some other minimum read depth?

Thanks, Wes

ybdong919 commented 11 months ago

Generally, how much sequencing depth is enough?

dylanpilz commented 11 months ago

Hi Dylan,

Thanks for your response. I do have a follow-up question. Would it be the case that in previous versions, before the --depthcutoff option was added that variant sites with only 1x read depth would be included in the lineage abundance calculation or was there some other minimum read depth?

Thanks, Wes

Yes that's correct, prior to this feature there wasn't any exclusion threshold based on coverage.

whottel commented 11 months ago

Great, thanks for the clarification.

joshuailevy commented 11 months ago

@ybdong919 There isn't really a set threshold that is "enough". The answer tends to depend on what you're trying to infer (you'll need more coverage if you want to recover lineage-level frequencies, but if you just want VoC frequencies you can work with less coverage). As a heuristic- usually 60% genome coverage at 10x read depth is ok, but results will depend strongly on the specific regions of the genome that are covered. The lineage collapse functionality, enabled using the --depthcutoff parameter in demix can be useful in figuring out which lineages can be differentiated given the available coverage.