abyzovlab / CNVpytor

a python extension of CNVnator -- a tool for CNV analysis from depth-of-coverage by mapped reads
MIT License
179 stars 26 forks source link

Problem exporting .vcf file #173

Closed lunalvan closed 1 year ago

lunalvan commented 1 year ago

Hi, I am using CNVpytor to call CNVs from WGS data, and an issue has emerged when exporting the calls in .vcf format.

After the CNV calling step, the .tsv generated file looks good, with calls from all the different chromosomes. Afterwards, I enter the interactive mode by running

cnvpytor -root input.pytor -view 500

After that, I want to generate both an .xlsx and a .vcf files. There is no trouble generating the excel one, it also contains all of the information. However, when doing the same thing for a .vcf file, this error comes up:

cnvpytor> set print_filename calls.vcf
    * print_filename: calls.vcf
cnvpytor> print calls
[W::bcf_update_info] INFO/END=0 is smaller than POS at chrUn_GL000218v1:1
cnvpytor> quit

When looking into the resulting vcf file, only calls from chrUN_GL000218v1 are present, no other calls from any other chromosomes. Again, the .xlsx and .tsv files are okay. Do you maybe know what could be causing this?

arpanda commented 1 year ago

Thank you for the reporting the issue. A bug fix is done for the VCF export. It should export calls for all the chromosomes.

[W::bcf_update_info] INFO/END=0 is smaller than POS at chrUn_GL000218v1:1

It's a warning message and not causing any problems for the VCF export. I compared it with the print calls. We are checking the cause of the warning message.

Please let me know whether the export working for you properly or not.

-Arijit

lunalvan commented 1 year ago

Hi! It is not exporting the VCF correctly, but the other formats (TSV, XSLX) are fine. This is what I did:

Went through all of the steps to call CNVs from a BAM file, then entered view mode (the output of 'print calls' written here is only a part of the full one I get, as it is way too long; anyways, it shows CNVs from all chromosomes)

cnvpytor -root input.pytor -rd file.bam
cnvpytor -root input.pytor -his 500
cnvpytor -root input.pytor -partition 500
cnvpytor -root input.pytor -call 500
cnvpytor -root input.pytor -view 500

cnvpytor> print calls

deletion        chr15:17416501-17429500 13000   0.6160  6.907563e-03    5.306594e+08    9.268198e-03    1.582461e+07    0.8849  0.0000  322500
deletion        chr15:17430501-17432500 2000    0.3927  1.304437e+03    8.317632e-05    1.000000e+00    1.000000e+00    1.0000  0.0000  319500
deletion        chr15:17434001-17435500 1500    0.3999  6.452230e+03    8.242455e-04    1.000000e+00    1.000000e+00    1.0000  0.0000  316500
deletion        chr15:17447501-17458000 10500   0.5473  1.153577e-03    3.370347e+08    1.694639e-01    5.068574e+08    0.8953  0.0000  294000
deletion        chr15:17460501-17463500 3000    0.3499  8.689704e+01    1.415738e-11    1.114358e+05    4.886709e+02    0.8685  0.0000  288500

cnvpytor> set print_filename calls.tsv
    * print_filename: calls.tsv
cnvpytor> print calls
cnvpytor> quit

When opening the TSV file, all chromosomes are present:

FG014001        rd_mean_shift   deletion        chr1    1.0     10000.0 10000.0 0.0     0.0     1.035158820320922e-283  0.0     3.1886894735586106e-225 -1.0    1.0     0.0
FG014001        rd_mean_shift   deletion        chr1    77001.0 93000.0 16000.0 0.5861244203691571      5.17960674351059e-10    2.831344821281673       2.030861193499537e-08   37.82228977146194       0.26245505906522854     0.0     67100.0
FG014001        rd_mean_shift   deletion        chr1    139501.0        155000.0        15500.0 0.5841276440529031      2.4409700067680674e-08  454913.0033028214       5.816907918306432e-06   1406903.9842667007      0.933991683991684       0.0     52800.0

As previously said, when I try to do the same thing for a VCF file, the following message appears:

cnvpytor> set print_filename calls.vcf
    * print_filename: calls.vcf
cnvpytor> print calls
[W::bcf_update_info] INFO/END=0 is smaller than POS at chrUn_GL000218v1:1
cnvpytor> quit

When looking into the generated VCF, there are a lot of calls, but all of them belong to the 'chrUn_GL000218v1:1' chromosome that was mentioned in the warning message.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  FG014001
chrUn_GL000218v1        2       CNVpytor_del0   .       <DEL>   .       PASS    END=10000;IMPRECISE;SVLEN=10000;SVTYPE=DEL;pytorRD=0;pytorP1=0;pytorP2=0;pytorP3=0;pytorP4=0;pytorQ0=-1;pytorPN=1;pytorDG=0;pytorCL=rd_mean_shift       GT:CN   1/1:0
chrUn_GL000218v1        77001   CNVpytor_del1   .       <DEL>   .       PASS    END=93000;IMPRECISE;SVLEN=16000;SVTYPE=DEL;pytorRD=0.586124;pytorP1=5.17961e-10;pytorP2=2.83134;pytorP3=2.03086e-08;pytorP4=37.8223;pytorQ0=0.262455;pytorPN=0;pytorDG=67100;pytorCL=rd_mean_shift      GT:CN   0/1:0
chrUn_GL000218v1        139501  CNVpytor_del2   .       <DEL>   .       PASS    END=155000;IMPRECISE;SVLEN=15500;SVTYPE=DEL;pytorRD=0.584128;pytorP1=2.44097e-08;pytorP2=454913;pytorP3=5.81691e-06;pytorP4=1.4069e+06;pytorQ0=0.933992;pytorPN=0;pytorDG=52800;pytorCL=rd_mean_shift   GT:CN   0/1:0
chrUn_GL000218v1        160001  CNVpytor_del3   .       <DEL>   .       PASS    END=177000;IMPRECISE;SVLEN=17000;SVTYPE=DEL;pytorRD=0.653613;pytorP1=5.67299e-06;pytorP2=1.69651e+08;pytorP3=0.00033837;pytorP4=2.36637e+08;pytorQ0=0.963744;pytorPN=0;pytorDG=30800;pytorCL=rd_mean_shift      GT:CN   0/1:0
chrUn_GL000218v1        207501  CNVpytor_del4   .       <DEL>   .       PASS    END=257500;IMPRECISE;SVLEN=50000;SVTYPE=DEL;pytorRD=0.000890327;pytorP1=0;pytorP2=0;pytorP3=0;pytorP4=0;pytorQ0=0.777778;pytorPN=0;pytorDG=0;pytorCL=rd_mean_shift      GT:CN   1/1:0

It may be worth noting that, when running 'print calls' with the view mode, the last three calls are:

duplication     chrUn_GL000218v1:1-80000        80000   2.1118  0.000000e+00    2.081059e-14    0.000000e+00    7.884573e-14    0.0631  0.0000  100
duplication     chrUn_GL000218v1:81501-98500    17000   1.8095  3.543695e-09    7.375459e-34    3.803691e-09    9.862915e-42    0.1515  0.0000  62800
duplication     chrUn_GL000218v1:99501-161500   62000   1.9478  0.000000e+00    2.358886e+09    0.000000e+00    2.822972e-184   0.1595  0.0057  100

Which are the only calls that belong to the 'chrUn_GL000218v1:1' chromosome in the original calls.

I also tried to run the full thing only with chromosomes 1-22, X, Y and MT. The same error appeared, but this time regarding chromosome 22. The VCF file only showed calls from chr22.

I hope this information was helpful! Thanks!

arpanda commented 1 year ago

Currently the update is available to latest development version only. Did you install the latest development version? if not, command to install

pip install git+https://github.com/abyzovlab/CNVpytor.git

It will be added to our upcoming release.

[W::bcf_update_info] INFO/END=0 is smaller than POS at chrUn_GL000218v1:1

The above warning message will be addressed by the upcoming pysam release and not causing any issue. Ref pysam-developers/pysam/issues/1175

lunalvan commented 1 year ago

Hi, I have installed the latest development version, but it still does not work.

I thought about looking at the VCF and TSV files that are generated from the same run and I spotted something.

These are the first lines in the VCF file that is generated

chrUn_GL000218v1        77001   CNVpytor_del0   .       <DEL>   .       PASS    END=93000;IMPRECISE;SVLEN=16000;SVTYPE=DEL;pytorRD=0.586124;pytorP1=5.17961e-10;pytorP2=2.83134;pytorP3=2.03086e-08;pytorP4=37.8223;pytorQ0=0.262455;pytorPN=0;pytorDG=67100;pytorCL=rd_mean_shift      GT:CN   0/1:0
chrUn_GL000218v1        276001  CNVpytor_dup1   .       <DUP>   .       PASS    END=289000;IMPRECISE;SVLEN=13000;SVTYPE=DUP;pytorRD=1.77527;pytorP1=1.6296e-06;pytorP2=1.74525e-25;pytorP3=0.000132433;pytorP4=3.20541e-20;pytorQ0=0.335371;pytorPN=0;pytorDG=9100;pytorCL=rd_mean_shift        GT:CN   1/1:2
chrUn_GL000218v1        297501  CNVpytor_del2   .       <DEL>   .       PASS    END=348000;IMPRECISE;SVLEN=50500;SVTYPE=DEL;pytorRD=0.00887088;pytorP1=0;pytorP2=0;pytorP3=0;pytorP4=0;pytorQ0=0.131868;pytorPN=0;pytorDG=0;pytorCL=rd_mean_shift       GT:CN   1/1:0
chrUn_GL000218v1        13224501        CNVpytor_dup3   .       <DUP>   .       PASS    END=13247500;IMPRECISE;SVLEN=23000;SVTYPE=DUP;pytorRD=1.97146;pytorP1=0;pytorP2=5.68356e-06;pytorP3=0;pytorP4=0.000107939;pytorQ0=0.00876018;pytorPN=0;pytorDG=220300;pytorCL=rd_mean_shift     GT:CN   1/1:2

And these are the ones from the TSV file

FG014001        rd_mean_shift   deletion        chr1    77001.0 93000.0 16000.0 0.5861244203691571      5.17960674351059e-10    2.831344821281673       2.030861193499537e-08   37.82228977146194       0.26245505906522854     0.0     67100.0
FG014001        rd_mean_shift   duplication     chr1    276001.0        289000.0        13000.0 1.775269007501444       1.629596227190269e-06   1.7452471084512207e-25  0.00013243259089357196  3.205413376007881e-20   0.33537099494097805     0.0     9100.0
FG014001        rd_mean_shift   deletion        chr1    297501.0        348000.0        50500.0 0.008870882819780327    0.0     3.08855205995407e-60    0.0     0.0     0.13186813186813187     0.9900990099009901      0.0
FG014001        rd_mean_shift   duplication     chr1    13224501.0      13247500.0      23000.0 1.9714585906709374      0.0     5.683558121712347e-06   0.0     0.00010793889006778407  0.008760177264763476    0.0     220300.0

They are the same calls, but for some reason, when the file is exported to VCF format, it changes the names of the chromosomes so that all of the CNVs belong to the mentioned chromosome in the warning message. I have no idea why this happens, is there any way I could fix this?

Thanks,

Luna

arpanda commented 1 year ago

I am not having the issue and created a colab link for you to show. Colab link: https://colab.research.google.com/drive/1jTbnk0tKcALZKj2_dRcyoVaL0IQrPGlE?usp=sharing Here is the example calls from multiple chromosome exported in VCF format.

chr5    46400001    CNVpytor_del16  .   <DEL>   .   PASS    END=49400000;IMPRECISE;SVLEN=3000000;SVTYPE=DEL;pytorRD=0.00323983;pytorP1=0;pytorP2=0;pytorP3=0;pytorP4=0;pytorQ0=0.265431;pytorPN=0;pytorDG=0;pytorCL=rd_mean_shift   GT:CN   1/1:0
chr5    58700001    CNVpytor_del17  .   <DEL>   .   PASS    END=59400000;IMPRECISE;SVLEN=700000;SVTYPE=DEL;pytorRD=0.375306;pytorP1=0.386142;pytorP2=0;pytorP3=0.386142;pytorP4=0;pytorQ0=0.00611391;pytorPN=0;pytorDG=9294500;pytorCL=rd_mean_shift    GT:CN   0/1:1
chr6    2   CNVpytor_del18  .   <DEL>   .   PASS    END=200000;IMPRECISE;SVLEN=200000;SVTYPE=DEL;pytorRD=0.561811;pytorP1=1205.83;pytorP2=0.00313723;pytorP3=1205.83;pytorP4=0.00313723;pytorQ0=0.827092;pytorPN=0;pytorDG=0;pytorCL=rd_mean_shift  GT:CN   0/1:1
chr6    200001  CNVpytor_dup19  .   <DUP>   .   PASS    END=28500000;IMPRECISE;SVLEN=28300000;SVTYPE=DUP;pytorRD=1.44527;pytorP1=0;pytorP2=0;pytorP3=0;pytorP4=0;pytorQ0=0.00767461;pytorPN=0;pytorDG=140100;pytorCL=rd_mean_shift  GT:CN   0/1:3
chr6    28500001    CNVpytor_del20  .   <DEL>   .   PASS    END=33400000;IMPRECISE;SVLEN=4900000;SVTYPE=DEL;pytorRD=0.269621;pytorP1=0;pytorP2=0;pytorP3=0;pytorP4=0;pytorQ0=0.840541;pytorPN=0;pytorDG=24687800;pytorCL=rd_mean_shift  GT:CN   0/1:1