abyzovlab / CNVnator

a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads
Other
209 stars 66 forks source link

normalized_RD is "inf" #170

Closed BaiqiFu closed 3 years ago

BaiqiFu commented 5 years ago

Hi, I'm using CNVnator (version 0.4.1) for human data (reference hg38),my command as shown below:

1)/data1/software/cnvnator/0.4.1/cnvnator -root root/${sm}.root -tree $line -chrom $(seq -f 'chr%g' 1 22|xargs) chrX chrY >root/${sm}.log 2>&1

2)/data1/software/cnvnator/0.4.1/cnvnator -root $line -his 1000 -fasta hg38/Homo_sapiens_assembly38.fasta -chrom $(seq -f 'chr%g' 1 22|xargs) chrX chrY >root/${sm}.his.log 2>&1

3)/data1/software/cnvnator/0.4.1/cnvnator -root $line -stat 1000 -chrom $(seq -f 'chr%g' 1 22|xargs) chrX chrY >root/${sm}.stat.log 2>&1

4)/data1/software/cnvnator/0.4.1/cnvnator -root $line -partition 1000 -chrom $(seq -f 'chr%g' 1 22|xargs) chrX chrY >root/${sm}.partition.log 2>&1

5)/data1/software/cnvnator/0.4.1/cnvnator -root $line -call 1000 -chrom $(seq -f 'chr%g' 1 22|xargs) chrX chrY > root/${sm}.cnvnator.out 2>root/${sm}.call.log

the root/${sm}.stat.log: Average RD per bin (1-22) is 1.40171e-12 +- 19.8455 (before GC correction) Average RD per bin (X,Y) is 6.30303e-12 +- 14.7177 (before GC correction) Correcting counts by GC-content for 'chr1' ... Zero value of GC average. Bin 2229 with center 2.2285e+06 is not corrected. Correcting counts by GC-content for 'chr2' ... Zero value of GC average. Bin 86915 with center 8.69145e+07 is not corrected. Correcting counts by GC-content for 'chr3' ... Correcting counts by GC-content for 'chr4' ... Zero value of GC average. Bin 32834 with center 3.28335e+07 is not corrected. ........................................................................................................ Average RD per bin (1-22) is 0 +- 12.806 (after GC correction) Average RD per bin (X,Y) is 1.08074e-14 +- 8.42994 (after GC correction)

the root/${sm}.partition.log is very large: Partitioning RD signal for 'chr1' with bin size of 1000 ... Average RD per bin is 0 +- 12.806 Bin band is 2 Abnormal range (0, -1) Abnormal range (1, 0) Abnormal range (2, 1) Abnormal range (3, 2) Abnormal range (4, 3) Abnormal range (5, 4) Abnormal range (6, 5) Abnormal range (7, 6) Abnormal range (8, 7) Abnormal range (9, 8) Abnormal range (32, 31) Abnormal range (42, 41) Abnormal range (52, 51) ............................................................ Partitioning RD signal for 'chrY' with bin size of 1000 ... Average RD per bin is 1.08074e-14 +- 8.42994 Bin band is 2 Bin band is 3 Bin band is 4 Bin band is 5 Bin band is 6 Bin band is 7 Bin band is 8 Bin band is 10 Bin band is 12 Bin band is 14 Bin band is 16 Bin band is 20 Bin band is 24 Bin band is 28 Bin band is 32 Bin band is 40 Bin band is 48 Bin band is 56 Bin band is 64 Bin band is 80 Bin band is 96 Bin band is 112 Bin band is 128

the output: duplication chr1:1-410000 410000 inf 0.0141471 1.08576e-114 0.0142255 4.34305e-114 0.606664 duplication chr1:467001-642000 175000 inf 0.0477066 5.99493e-44 0.0599206 2.39797e-43 0.774704 duplication chr1:690001-839000 149000 inf 0.0465148 4.02313e-36 0.0510009 1.60925e-35 0.489465 duplication chr1:844001-2063000 1.219e+06 inf 0 0 0 0 0.249378 duplication chr1:2064001-2421000 357000 inf 3.07817e-08 9.77968e-99 3.22606e-08 3.91187e-98 0.0706259 duplication chr1:2422001-2647000 225000 inf 4.71593e-06 5.32457e-59 4.91936e-06 2.12983e-58 0.118396 duplication chr1:2648001-2703000 55000 inf 1.01687e-05 1.47532e-09 1.95645e-05 6.82253e-09 0.559996 duplication chr1:2799001-2995000 196000 inf 0.00594976 2.53694e-54 0.00607774 1.11613e-53 0.11628 ........................................................................................................................................... the normalized_RD is "inf", is this correct? How does it come about?

Could you please look into my issue and tell me what might possibly gone wrong. Thank you!!!

abyzov commented 5 years ago

Hi, not sure what coverage you have but looks like nothing got save into root file. “Average RD per bin (1-22) is 1.40171e-12 +- 19.8455 (before GC correction)”

Please check that you extract reads for correct chromosomes (the first step).

Alexej Abyzov, Ph.D. Senior Associate Consultant, Assistant Professor of Biomedical Informatics, Department of Health Sciences Research, Center for Individualized Medicine, Mayo Clinic

Mayo Clinic, 200 1st street SW, Harwick 3-12 Rochester, MN 55905 www.abyzovlab.orghttp://www.abyzovlab.org tel: +1-(507)-538-0978 fax: +1-(507)-284-0745

BaiqiFu commented 5 years ago

Thank you for answers! The first step should be right, the output log of first step as shown below:

Parsing file ../../2.Mapping/bqsr/CWD4_bqsr.bam ... Allocating memory ... Done. Filling and saving tree for 'chr1' ... Filling and saving tree for 'chr2' ... Filling and saving tree for 'chr3' ... Filling and saving tree for 'chr4' ... Filling and saving tree for 'chr5' ... Filling and saving tree for 'chr6' ... Filling and saving tree for 'chr7' ... Filling and saving tree for 'chr8' ... Filling and saving tree for 'chr9' ... Filling and saving tree for 'chr10' ... Filling and saving tree for 'chr11' ... Filling and saving tree for 'chr12' ... Filling and saving tree for 'chr13' ... Filling and saving tree for 'chr14' ... Filling and saving tree for 'chr15' ... Filling and saving tree for 'chr16' ... Filling and saving tree for 'chr17' ... Filling and saving tree for 'chr18' ... Filling and saving tree for 'chr19' ... Filling and saving tree for 'chr20' ... Filling and saving tree for 'chr21' ... Filling and saving tree for 'chr22' ... Filling and saving tree for 'chrX' ... Filling and saving tree for 'chrY' ... Writing histograms ... Total of 200918007 reads were placed.

These samples include single cells and peripheral blood, peripheral blood looks normal:

Average RD per bin (1-22) is 181.463 +- 30.2639 (before GC correction) Average RD per bin (X,Y) is 177.827 +- 28.3298 (before GC correction) .............................................................................................................................................. Average RD per bin (1-22) is 180.794 +- 22.2768 (after GC correction) Average RD per bin (X,Y) is 202.965 +- 25.6514 (after GC correction)

But the single cell is not a complete nucleus, half a nucleus, or even less. So the coverage of reference is 8% ~ 30%. The peripheral blood more than 99%, and without "inf": duplication chr1:92001-120000 28000 1.46816 2.44796e-07 8.5648e-21 4.05174e-07 8.76541e-37 0.607605 deletion chr1:140001-183000 43000 0.73294 4.18446e-09 1.36824e+07 2.39447e-09 1.75451e+07 0.922455 deletion chr1:207001-258000 51000 0.0161502 3.12495e-12 1.08464e-150 3.2525e-12 0 0.799458 duplication chr1:261001-286000 25000 1.49764 2.86321e-05 6.22808e-07 7.15144e-05 1.11545e-05 0.555235 deletion chr1:298001-348000 50000 0 3.18745e-12 0 3.32026e-12 0 0.705128

abyzov commented 5 years ago

What do you mean by half or even less nucleus? Do you mean that 50% of genome has no coverage?

Alexej Abyzov, Ph.D. Senior Associate Consultant, Assistant Professor of Biomedical Informatics, Department of Health Sciences Research, Center for Individualized Medicine, Mayo Clinic

Mayo Clinic, 200 1st street SW, Harwick 3-12 Rochester, MN 55905 www.abyzovlab.orghttp://www.abyzovlab.org tel: +1-(507)-538-0978 fax: +1-(507)-284-0745

BaiqiFu commented 5 years ago

Yes! In fact, only 8% ~ 30% of reference genome with coverage data It was caused by the experimental operation, the coverage is now irremediable

abyzov commented 5 years ago

Hi, is it because average coverage is too shallow or because only 30% of the genome is amplified?

Alexej Abyzov, Ph.D. Senior Associate Consultant, Assistant Professor of Biomedical Informatics, Department of Health Sciences Research, Center for Individualized Medicine, Mayo Clinic

Mayo Clinic, 200 1st street SW, Harwick 3-12 Rochester, MN 55905 www.abyzovlab.orghttp://www.abyzovlab.org tel: +1-(507)-538-0978 fax: +1-(507)-284-0745

BaiqiFu commented 5 years ago

My research includes two types of samples: single cell and peripheral blood. Re-sequencing: PE150,~90G data per sample; the peripheral blood is normal, > 99% of reference genome with coverage data; but single cell re-sequencing cut half nucleus, and contaminated with bacteria. So only < 50% of the reference genome is sequenced.

abyzov commented 5 years ago

Hi, CNVnator is not designed to work with such an uneven coverage. Results won’t be trusted and there are possible various issues like what you are experiencing. I would only recommend to use the tool for data where the coverage is more or less uniform.

Alexej Abyzov, Ph.D. Senior Associate Consultant, Assistant Professor of Biomedical Informatics, Department of Health Sciences Research, Center for Individualized Medicine, Mayo Clinic

Mayo Clinic, 200 1st street SW, Harwick 3-12 Rochester, MN 55905 www.abyzovlab.orghttp://www.abyzovlab.org tel: +1-(507)-538-0978 fax: +1-(507)-284-0745

BaiqiFu commented 5 years ago

Thanks for your very kind reply to my question!