etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
545 stars 165 forks source link

Errors in fix.py #740

Open wqrao opened 2 years ago

wqrao commented 2 years ago

Hello, I was doing germline analysis following the user guide using cnvkit 0.9.9. However, I have some questions about outputting cnr. I use the code cnvkit.py batch B1_bqsr.bam --segment-method hmm-germline -p 8 --output-reference result/ -r FlatRef.cnn Then cnvkit.py fix B1_bqsr.targetcoverage.cnn B1_bqsr.antitargetcoverage.cnn FlatRef.cnn -o B1.cnr. In this step, there are errors as follows:

Traceback (most recent call last):
  File "/home/miniconda3/bin/cnvkit.py", line 9, in <module>
    args.func(args)
  File "/home/miniconda3/lib/python3.7/site-packages/cnvlib/commands.py", line 608, in _cmd_fix
    args.cluster)
  File "/home/miniconda3/lib/python3.7/site-packages/cnvlib/fix.py", line 16, in do_fix
    True, do_gc, do_edge, False)
  File "/home/miniconda3/lib/python3.7/site-packages/cnvlib/fix.py", line 69, in load_adjust_coverages
    ref_matched = match_ref_to_sample(ref_cnarr, cnarr)
  File "/home/miniconda3/lib/python3.7/site-packages/cnvlib/fix.py", line 139, in match_ref_to_sample
    "{}.").format(name, len(dset.index[dupes]), "\n".join(map(str, dset.index[dupes][:10]))))
ValueError: Duplicated genomic coordinates in sample set. Total duplicated regions: 420, starting with:
('chr1', 10430517, 10430567)
('chr1', 10433841, 10433964)
('chr1', 10434656, 10434689)
('chr1', 10440346, 10440412)
('chr1', 10451376, 10451594)
('chr1', 111503004, 111503353)
('chr2', 86505803, 86505948)
('chr2', 86507478, 86507592)
('chr2', 86510357, 86510478)
('chr2', 86529217, 86529396).

How to deal with it? Thank you for your help.

tetedange13 commented 2 years ago

Hi @wqrao ,

In the user guide it is said that to build a reference for CNVkit (your "flatRef.cnn"), you are supposed to use "baited genomic regions for your target capture kit, as provided by your vendor" => Which are not supposed to contain duplicated regions (the reason why you get an error)

Hope this helps ! Kind regards, Felix.

wqrao commented 2 years ago

Hello @tetedange13 , Many thanks for your help. I regenerated the "flatRef.cnn" and built .cnr file successfully. However, the next step failed again. I wrote cnvkit.py segment ./res/B1_bqsr.cnr -o ./res/B1_bqsr.cns -m hmm-germline -p 8 for generating .cns There are errors as follows: Segmenting with method 'hmm-germline' in 8 processes Smoothing overshot at 6 / 15715 indices: (-27.561884666418475, 2.423114113233022) vs. original (-26.4023, 5.41255) Smoothing overshot at 11 / 15661 indices: (-32.590967831385235, 1.1132613787142873) vs. original (-26.4023, 4.01252) Smoothing overshot at 4 / 109 indices: (-8.495912514212478, 5.0282835362849) vs. original (-26.1403, 3.58033) Building model from observations Traceback (most recent call last): File "/home/yaoxq/miniconda3/bin/cnvkit.py", line 9, in <module> args.func(args) File "/home/yaoxq/miniconda3/lib/python3.7/site-packages/cnvlib/commands.py", line 660, in _cmd_segment smooth_cbs=args.smooth_cbs) File "/home/yaoxq/miniconda3/lib/python3.7/site-packages/cnvlib/segmentation/__init__.py", line 54, in do_segmentation rscript_path) File "/home/yaoxq/miniconda3/lib/python3.7/site-packages/cnvlib/segmentation/__init__.py", line 139, in _do_segmentation segarr = hmm.segment_hmm(filtered_cn, method, threshold, variants) File "/home/yaoxq/miniconda3/lib/python3.7/site-packages/cnvlib/segmentation/hmm.py", line 45, in segment_hmm model = hmm_get_model(cnarr, method, processes) File "/home/yaoxq/miniconda3/lib/python3.7/site-packages/cnvlib/segmentation/hmm.py", line 139, in hmm_get_model verbose=False) File "pomegranate/hmm.pyx", line 2564, in pomegranate.hmm.HiddenMarkovModel.fit File "pomegranate/hmm.pyx", line 2565, in pomegranate.hmm.HiddenMarkovModel.fit TypeError: delayed() got an unexpected keyword argument 'check_pickle' Could you help me once more? Thanks very much!

tetedange13 commented 2 years ago

This error looks more like an issue related to required Python packages for hmm-germline specific segment-method => How did you install CNVkit exactly ? I see you are running it inside a conda environnement => But did you install it inside its own conda env ? => If not, you may have a "Python package collision" = one of the package required were already installed in your base env, but not in the correct version for CNVkit

Try following command: conda create -n cnvkit -c conda-forge -c bioconda cnvkit => This should create you a dedicated conda-env named cnvkit, with everything properly installed inside => You can also do it using set up conda channels as explained in CNVkit's README

Also please note, as explained in the documentation, that hmm-germline method is experimental => Default segment-method (called CBS) performs well in most cases => Plus you may not know that there is a dedicated section explaining how you can proceed with germline samples

Hope you succeed ! Best wishes, Felix.

wqrao commented 2 years ago

Thanks a lot! I have run the process successfully. I export VCF format for call section. However, I am confused about the value of FOLD_CHANGE and FOLD_CHANGE_LOG. Are they the same as the log2 ratio? Could you explain something about it? I appreciate your help.

#CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | SampleID
chr1 | 12080 | . | N | <DEL> | . | . | IMPRECISE;SVTYPE=DEL;END=297507;SVLEN=-285427;FOLD_CHANGE=0.787774;FOLD_CHANGE_LOG=-0.344146;PROBES=40 | GT:GQ | 0/1:40
tetedange13 commented 2 years ago

I think these field names were thought self-explicit enough, but adding their formula in exported VCF header might help ? => Looking into export vcf code, we see following formulas: https://github.com/etal/cnvkit/blob/e29ec7de7bdd03892820a7a712c284c46b580dda/cnvlib/export.py#L327-L328

Meaning that:

Found another explaination inside call code : https://github.com/etal/cnvkit/blob/e29ec7de7bdd03892820a7a712c284c46b580dda/cnvlib/call.py#L269-L271

Hope this helped ! Kind regards, Felix.