XiaoTaoWang / NeoLoopFinder

A computation framework for genome-wide detection of enhancer-hijacking events from chromatin interaction data in re-arranged genomes
Other
57 stars 16 forks source link

CNV segmentation crash #3

Open cmdoret opened 3 years ago

cmdoret commented 3 years ago

Hi @XiaoTaoWang

When running the segment-cnv module, I keep running into the error below. When running at 10kb, the CNV profile was full of 0's, and I got this error on the first chromosome. With the same cool file, but rebinned at 100kb, the calculate-cnv module ran fine and the CNV profile looked alright, but I got the following traceback at chromosome 19. Any idea why this happens ?

Thanks

NOTE: I truncated the log at [...], but all chromosomes in between have normal INFO logs

root                      INFO    @ 06/09/21 22:02:15: 
# ARGUMENT LIST:
# CNV Profile = ../neoloopfinder/detected_cnv.txt
# Ploidy = 2
# Bin Size = 100000
# Output Path = ../neoloopfinder/segmented_cnv.txt
# Number of Processes = 12
# Log file name = cnv-seg.log
root                      INFO    @ 06/09/21 22:02:17: Loading CNV profile ...
root                      INFO    @ 06/09/21 22:02:17: Perform segmentation ...
neoloop.cnv.segcnv        INFO    @ 06/09/21 22:02:17: Segmenting Chromosome 1 ...
neoloop.cnv.segcnv        INFO    @ 06/09/21 22:02:22: Estimated HMM state number: 5 (log scale)
...
neoloop.cnv.segcnv        INFO    @ 06/09/21 22:07:57: Segmenting Chromosome 19 ...
neoloop.cnv.segcnv        INFO    @ 06/09/21 22:08:00: Estimated HMM state number: 4 (log scale)
Traceback (most recent call last):
  File "/home/cmatthey/anaconda3/bin/segment-cnv", line 87, in run
    work.segment()
  File "/home/cmatthey/anaconda3/lib/python3.8/site-packages/neoloop/cnv/segcnv.py", line 139, in segment
    sig, hmm_seg, scale = self._segment(sig) # pure HMM-based segmentation
  File "/home/cmatthey/anaconda3/lib/python3.8/site-packages/neoloop/cnv/segcnv.py", line 301, in _segment
    seq = np.r_[[s.name for i, s in model.viterbi(np.log2(queue))[1][1:]]]
TypeError: 'NoneType' object is not subscriptable
andyjslee commented 3 years ago

Hello @XiaoTaoWang, I am having the same problem as @cmdoret.

XiaoTaoWang commented 3 years ago

Hi @cmdoret and @andyjslee, I couldn't reproduce this error with any of the samples I have. Just curious, are you running calculate-cnv and segment-cnv on some extremely sparse matrix (since @cmdoret mentioned there are many 0s at 10kb)?

cmdoret commented 3 years ago

The matrix is a bit sparse indeed, it has 258M contacts and below are pictures of the region chr3:160Mb-195Mb at 10 and 100 kb resolution respectively. The 100kb matrix looks like alright to me.

I encountered the errors after running the following commands:

calculate-cnv -g hg19 -H PM25_100kb.cool --output ../neoloopfinder/detected_cnv.txt -e Arima --cachefolder ../neoloopfinder
segment-cnv --cnv-file ../neoloopfinder/detected_cnv.txt --binsize 100000 --output ../neoloopfinder/segmented_cnv.txt --nproc 12

image image

XiaoTaoWang commented 3 years ago

@cmdoret Interesting ... I think 258M contacts are more than enough for calculate-cnv and segment-cnv to be run at 10kb (and your 10kb map looks good to me), so the matrix sparsity shouldn't be the issue.

If you don't mind, can you share the CNV profiles (of a chromosome with CNV segmentation failed, maybe chr1 at 10kb and chr19 at 100kb) to me (wangxiaotao686@gmail.com), just for debugging. Thanks!

andyjslee commented 3 years ago

Hello @XiaoTaoWang, thank you for your prompt response. My sample has around 670M contacts.

cmdoret commented 3 years ago

Thanks for your help @XiaoTaoWang I sent you the files along with the matrix :)

XiaoTaoWang commented 3 years ago

Thanks @cmdoret, I have downloaded your files, however, the funny thing is that I still couldn't reproduce your error when I ran segment-cnv with the file "detected_cnv_100kb.txt" as input. I suspect the only reason might be minor differences between package versions, can you double check the versions of all the dependent packages (especially pomegranate) to make sure they are the same as the ones listed in https://github.com/XiaoTaoWang/NeoLoopFinder#installation?

Below are details of my test:

command I used:

$ segment-cnv --cnv-file detected_cnv_100kb.txt --ploidy 2 --binsize 100000  --output test.cnv-seg.100k.txt --nproc 4

CNV segments of chr19:

$ cat test.cnv-seg.100k.txt | grep chr19
chr19   0   700000  1
chr19   700000  3600000 2
chr19   3600000 4000000 1
chr19   4000000 14000000    2
chr19   14000000    16300000    3
chr19   16300000    22900000    2
chr19   22900000    23100000    3
chr19   23100000    23700000    2
chr19   23700000    24000000    6
chr19   24000000    24400000    3
chr19   24400000    27800000    1
chr19   27800000    50400000    2
chr19   50400000    54000000    1
chr19   54000000    54700000    2
chr19   54700000    55400000    1
chr19   55400000    59200000    2
cmdoret commented 3 years ago

Thanks for trying ! I have included a dump of my conda environment below. I did not notice an obvious problem with my package versions, but I agree it does look like a version compatibility bug. Just throwing some ideas here, sorry if I'm wrong (not familiar with pomegranate):

channels:
  - conda-forge
  - bioconda
  - defaults
  - r
dependencies:
  - cooler=0.8.6
  - scipy=1.3.1
  - networkx=1.11
  - pybigwig=0.3.17
  - scikit-learn=0.20.2
  - joblib=0.13.2
  - matplotlib=3.1.1
  - python=3.7.1
  - cython=0.29.13
  - pyensembl=1.8.0
  - numpy=1.17.2
  - rpy=2-2.9.4
  - r=3.5.1
  - r-mgcv
  - ipython
  - pomegranate=0.10.0
XiaoTaoWang commented 2 years ago

Hi @cmdoret I just released a new NeoLoopFinder version (https://github.com/XiaoTaoWang/NeoLoopFinder#release-notes). According to my test, this version is more stable and is compatible with the latest versions of dependent packages. Hope this update solves this issue!

AndrewC160 commented 2 years ago

Hello @XiaoTaoWang , I just wanted to note that I'm also seeing this error during segmentation after a fresh install of the environment. I will be troubleshooting as well!

Edit: I can confirm that I have at least one calculate-cnv output file that does run to completion with segment-cnv, and several others resolutions that do not: the only difference is bin size. In case you would like to see them, I've attached 5 files: the CNV bedGraphs, the segment-cnv log files , and the resulting segment file for the successful run. segment-cnv succeeds with the 1MB bin size CNV bedGraph, and fails with the 100k bin size file. I have also tried 10k, 50k, and 5MB, and 10MB bin sizes which appear to fail as well.

1MB_cnv.bedGraph.txt 1MB_segment.log 1MB_segments.bedGraph.txt 100k_cnv.bedGraph.txt 100k_segment.log

XiaoTaoWang commented 2 years ago

Hi @AndrewC160 , sorry for the late response. Thanks for sharing these files with me, they were very helpful for me to troubleshoot the instability of the original CNV segmentation algorithm. I finally solved this problem by changing the training dataset of HMM from chromosome-wide CNV signals to genome-wide CNV signals.

As a result, I released a new version 0.4.2 (https://pypi.org/project/neoloop/). Please upgrade your NeoLoopFinder to this version and try again. For your information, here is the command I used to perform the segmentation on the file "100k_cnv.bedGraph.txt":

$ segment-cnv --cnv-file 100k_cnv.bedGraph.txt --binsize 100000 --output 100k_cnv_seg.bedGraph.txt --num-of-states 3 --min-segment-size 3 --cbs-pvalue 5e-2 --max-dist 1

Note that I've also made key parameters of the segmentation algorithm tunable. The results will look like this. 100k_CNV bychrom

Let me know if you have any further questions! Thanks!

Xiaotao