Errors in fix.py - Githubissues

wqrao commented 2 years ago

Hello, I was doing germline analysis following the user guide using cnvkit 0.9.9. However, I have some questions about outputting cnr. I use the code cnvkit.py batch B1_bqsr.bam --segment-method hmm-germline -p 8 --output-reference result/ -r FlatRef.cnn Then cnvkit.py fix B1_bqsr.targetcoverage.cnn B1_bqsr.antitargetcoverage.cnn FlatRef.cnn -o B1.cnr. In this step, there are errors as follows:

Traceback (most recent call last):
  File "/home/miniconda3/bin/cnvkit.py", line 9, in <module>
    args.func(args)
  File "/home/miniconda3/lib/python3.7/site-packages/cnvlib/commands.py", line 608, in _cmd_fix
    args.cluster)
  File "/home/miniconda3/lib/python3.7/site-packages/cnvlib/fix.py", line 16, in do_fix
    True, do_gc, do_edge, False)
  File "/home/miniconda3/lib/python3.7/site-packages/cnvlib/fix.py", line 69, in load_adjust_coverages
    ref_matched = match_ref_to_sample(ref_cnarr, cnarr)
  File "/home/miniconda3/lib/python3.7/site-packages/cnvlib/fix.py", line 139, in match_ref_to_sample
    "{}.").format(name, len(dset.index[dupes]), "\n".join(map(str, dset.index[dupes][:10]))))
ValueError: Duplicated genomic coordinates in sample set. Total duplicated regions: 420, starting with:
('chr1', 10430517, 10430567)
('chr1', 10433841, 10433964)
('chr1', 10434656, 10434689)
('chr1', 10440346, 10440412)
('chr1', 10451376, 10451594)
('chr1', 111503004, 111503353)
('chr2', 86505803, 86505948)
('chr2', 86507478, 86507592)
('chr2', 86510357, 86510478)
('chr2', 86529217, 86529396).

How to deal with it? Thank you for your help.

tetedange13 commented 2 years ago

Hi @wqrao ,

In the user guide it is said that to build a reference for CNVkit (your "flatRef.cnn"), you are supposed to use "baited genomic regions for your target capture kit, as provided by your vendor" => Which are not supposed to contain duplicated regions (the reason why you get an error)

Hope this helps ! Kind regards, Felix.

wqrao commented 2 years ago

Hello @tetedange13 , Many thanks for your help. I regenerated the "flatRef.cnn" and built .cnr file successfully. However, the next step failed again. I wrote cnvkit.py segment ./res/B1_bqsr.cnr -o ./res/B1_bqsr.cns -m hmm-germline -p 8 for generating .cns There are errors as follows: Segmenting with method 'hmm-germline' in 8 processes Smoothing overshot at 6 / 15715 indices: (-27.561884666418475, 2.423114113233022) vs. original (-26.4023, 5.41255) Smoothing overshot at 11 / 15661 indices: (-32.590967831385235, 1.1132613787142873) vs. original (-26.4023, 4.01252) Smoothing overshot at 4 / 109 indices: (-8.495912514212478, 5.0282835362849) vs. original (-26.1403, 3.58033) Building model from observations Traceback (most recent call last): File "/home/yaoxq/miniconda3/bin/cnvkit.py", line 9, in <module> args.func(args) File "/home/yaoxq/miniconda3/lib/python3.7/site-packages/cnvlib/commands.py", line 660, in _cmd_segment smooth_cbs=args.smooth_cbs) File "/home/yaoxq/miniconda3/lib/python3.7/site-packages/cnvlib/segmentation/__init__.py", line 54, in do_segmentation rscript_path) File "/home/yaoxq/miniconda3/lib/python3.7/site-packages/cnvlib/segmentation/__init__.py", line 139, in _do_segmentation segarr = hmm.segment_hmm(filtered_cn, method, threshold, variants) File "/home/yaoxq/miniconda3/lib/python3.7/site-packages/cnvlib/segmentation/hmm.py", line 45, in segment_hmm model = hmm_get_model(cnarr, method, processes) File "/home/yaoxq/miniconda3/lib/python3.7/site-packages/cnvlib/segmentation/hmm.py", line 139, in hmm_get_model verbose=False) File "pomegranate/hmm.pyx", line 2564, in pomegranate.hmm.HiddenMarkovModel.fit File "pomegranate/hmm.pyx", line 2565, in pomegranate.hmm.HiddenMarkovModel.fit TypeError: delayed() got an unexpected keyword argument 'check_pickle' Could you help me once more? Thanks very much!

tetedange13 commented 2 years ago

This error looks more like an issue related to required Python packages for hmm-germline specific segment-method => How did you install CNVkit exactly ? I see you are running it inside a conda environnement => But did you install it inside its own conda env ? => If not, you may have a "Python package collision" = one of the package required were already installed in your base env, but not in the correct version for CNVkit

Try following command: conda create -n cnvkit -c conda-forge -c bioconda cnvkit => This should create you a dedicated conda-env named cnvkit, with everything properly installed inside => You can also do it using set up conda channels as explained in CNVkit's README

Also please note, as explained in the documentation, that hmm-germline method is experimental => Default segment-method (called CBS) performs well in most cases => Plus you may not know that there is a dedicated section explaining how you can proceed with germline samples

Hope you succeed ! Best wishes, Felix.

wqrao commented 2 years ago

Thanks a lot! I have run the process successfully. I export VCF format for call section. However, I am confused about the value of FOLD_CHANGE and FOLD_CHANGE_LOG. Are they the same as the log2 ratio? Could you explain something about it? I appreciate your help.

#CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | SampleID
chr1 | 12080 | . | N | <DEL> | . | . | IMPRECISE;SVTYPE=DEL;END=297507;SVLEN=-285427;FOLD_CHANGE=0.787774;FOLD_CHANGE_LOG=-0.344146;PROBES=40 | GT:GQ | 0/1:40

tetedange13 commented 2 years ago

I think these field names were thought self-explicit enough, but adding their formula in exported VCF header might help ? => Looking into export vcf code, we see following formulas: https://github.com/etal/cnvkit/blob/e29ec7de7bdd03892820a7a712c284c46b580dda/cnvlib/export.py#L327-L328

Meaning that:

FOLD_CHANGE_LOG is completely the same as "log2 ratio" (which can be seen as a "log fold change" as it is the log2-transformation of ratio (fold change) tumor_coverage / reference_coverage)
FOLD_CHANGE = 2 ** (power of or ^) log2_ratio = You revert log2-transformation, so you just have a fold change (ratio) in copy number left (relative to your reference ploidy so multiply this value by "2" for Human and you get an "absolute copy number")

Found another explaination inside call code : https://github.com/etal/cnvkit/blob/e29ec7de7bdd03892820a7a712c284c46b580dda/cnvlib/call.py#L269-L271

Hope this helped ! Kind regards, Felix.

etal / cnvkit

Errors in fix.py #740