abyzovlab / CNVpytor

a python extension of CNVnator -- a tool for CNV analysis from depth-of-coverage by mapped reads
MIT License
178 stars 26 forks source link

Strange results for global manhattan plots #205

Closed lucsnip closed 8 months ago

lucsnip commented 8 months ago

I have run a RD analysis on hi-fi long read sequence data for bin sizes 100bp, 1kb, 10kb, and 100kb. The 100kb plot looks reasonable, but the other are looking strange. The 10kb plot has a very broad variance compare with 100kb. The others just look bad. I believe I have followed all the instructions properly, so I am not sure what would cause this. I have included the images below. 100kb: BMK_100kb_CNV

10kb BMK_10kb_CNV

1kb: BMK_manplot_1kb

100bp: BMK_manplot_100bp

arpanda commented 8 months ago

What is the sequencing coverage, and is it low? Please review the results of the stat command with the following syntax:

cnvpytor -root <pytor file> -stat <bin size>

-Arijit

lucsnip commented 8 months ago

Hi Arijit,

The coverage should be 30x. The stat output is quite long. What am I looking for? I notice it is giving these warnings occasionally while the program is running: cnvpytor.utils - WARNING - Problem with fit: Runtime Error. Using mean and std instead fitting parameters! cnvpytor.utils - WARNING - Problem with fit: insufficient data points. Using mean and std instead fitting parameters!

arpanda commented 8 months ago

Yes, you are correct. The fitting for rd didn't work properly for some of those bins. Please examine the fitting curves in 'view mode'.

cnvpytor -root <pytor file> -view <bin size>
cnvpytor> rdstat

This could assist in explaining the reason behind the misfit.

lucsnip commented 8 months ago

Here is the output for bin size 100

cnvpytor -conf mm10_ref_conf.py -root B6MaleKidney_mm10_masked_rd.pytor -view 100
2024-01-08 18:30:58,011 - cnvpytor.genome - INFO - Reading configuration file 'mm10_ref_conf.py'.
2024-01-08 18:30:58,011 - cnvpytor.genome - INFO - Importing reference genome data: 'mm10'.
cnvpytor> rdstat
2024-01-08 18:31:13,604 - cnvpytor.viewer - INFO - RD stat for Autosomes: 1.11 +- 0.34
2024-01-08 18:31:13,629 - cnvpytor.viewer - INFO - RD stat for X/Y: 1.06 +- 0.26
2024-01-08 18:31:13,650 - cnvpytor.viewer - INFO - RD stat for Mitochondria: 137.71 +- 29.64
2024-01-08 18:31:13,650 - cnvpytor.viewer - INFO - RD stat for Mitochondria - number of mitochondria per cell: 249.16 +- 91.84
lucsnip commented 8 months ago

Additionally, my data is PacBio long read DNA sequence. Is that likely to be a source of issues with fit?

arpanda commented 8 months ago

Could you double-check if the sequencing coverage is 30x? If so, can you confirm whether this coverage applies to the entire genome or a targeted panel?

-Arijit

lucsnip commented 8 months ago

Yes, the sequencing coverage is 30x, and whole genome, not targeted. Is it possible the default settings are not optimized for long reads?

arpanda commented 8 months ago

I've just come to the realization that having a bin size greater than the read length is necessary for it to work to some extent. As you are using long read data, the small bin size i.e., 100, 1k, 10k is not working properly. Relying solely on this read depth-based cnvpytor approach may lead to the oversight of small events and misses many details.

I would recommend incorporating a BAF-based approach by importing variant information. Following this, you can cross-reference the results with the read depth-based data to confirm events.

lucsnip commented 8 months ago

That makes sense. I did notice that the fit warnings are still happing for 100k bin size, however.