Tian-Dechao / diffDomain

DiffDomain is a statistically sound method for detecting differential TADs between conditions
MIT License
13 stars 4 forks source link

Error when using a mcool file as input #13

Closed FlavianR closed 10 months ago

FlavianR commented 10 months ago

Hello again!

I have an issue while I was testing dvsd one.

I use as follow :

python ../diffDomain/diffdomain-py3/diffdomains.py dvsd one chr2 10000000 15000000 --reso 40000 data1.mcool data2.mcool

And I obtain the error :

Traceback (most recent call last):
  File "../diffDomain/diffdomain-py3/diffdomains.py", line 45, in <module>
    result = comp2domins_by_twtest(chrn=opts['<chr>'], start=int(opts['<start>']), end=int(opts['<end>']), reso=int(opts['--reso']), hicnorm=opts['--hicnorm'], fhic0=opts['<hic0>'], fhic1=opts['<hic1>'], min_nbin=int(opts['--min_nbin']), f=opts['--f'])
  File "../diffdomain-py3/utils.py", line 398, in comp2domins_by_twtest
    Diffmatnorm = normDiffbyMeanSD(D=Diffmat)
  File "../diffDomain/diffdomain-py3/utils.py", line 291, in normDiffbyMeanSD
    b[k] = np.max(val1)
  File "<__array_function__ internals>", line 180, in amax
  File "/env/products/python/3.8.11/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2791, in amax
    return _wrapreduction(a, np.maximum, 'max', axis, None, out,
  File "/env/products/python/3.8.11/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity

It's seem that this issue isn't visible when you use a hic file.

Thank you in advance for your answers.

FlavianR commented 10 months ago

To add, I tried to used dvsd multiple, like bellow:

python ../diffDomain/diffdomain-py3/diffdomains.p dvsd multiple data1.mcool data2.mcool TAD_list_data1.bed --reso 40000

In the TAD_list_data1.bed, I got all the domains find by the tool SpectralTADs on all the chromosoms. But I get an error :

Traceback (most recent call last):
  File "/env/products/python/3.8.11/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/env/ig/soft/eco3d/diffdomain/2023.11.09/bin/../lib/python3.8/site-packages/diffdomain_py3/diffdomains.py", line 57, in comp2domins_by_twtest_parallel
    tmp_res = comp2domins_by_twtest(chrn=tadb.iloc[i, 0], start=tadb.iloc[i, 1],
  File "/env/export/v_info/q_soft/eco3d/diffdomain/2023.11.09/lib/python3.8/site-packages/diffdomain_py3/utils.py", line 357, in comp2domins_by_twtest
    mat0 = contact_matrix_from_hic(chrn, start, end, reso, fhic0, hicnorm)
  File "/env/export/v_info/q_soft/eco3d/diffdomain/2023.11.09/lib/python3.8/site-packages/diffdomain_py3/utils.py", line 213, in contact_matrix_from_hic
    mat = c.matrix(balance=True).fetch(f'{chrn}:{start}-{end}')
  File "/env/ig/soft/eco3d/hicexplorer/3.7.2/lib/python3.8/site-packages/cooler/core/_selectors.py", line 149, in fetch
    i0, i1, j0, j1 = self._fetch(*args, **kwargs)
  File "/env/ig/soft/eco3d/hicexplorer/3.7.2/lib/python3.8/site-packages/cooler/api.py", line 392, in _fetch
    region1 = parse_region(region, self._chromsizes)
  File "/env/ig/soft/eco3d/hicexplorer/3.7.2/lib/python3.8/site-packages/cooler/util.py", line 178, in parse_region
    raise ValueError(f"Genomic region out of bounds: [{start}, {end})")
ValueError: Genomic region out of bounds: [132160000, 133800000)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/env/ig/soft/eco3d/diffdomain/2023.11.09/bin/../lib/python3.8/site-packages/diffdomain_py3/diffdomains.py", line 76, in <module>
    result.append(i.get())
  File "/env/products/python/3.8.11/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
ValueError: Genomic region out of bounds: [132160000, 133800000)

After some research, I understood that this error is because when I gave a domain you took on a region around to calculte the differential score. But when we are like this at the end of the chromsome it's doesn't work.

Moreover the error "The matrix is too spase or too small at this resolution to be calculated !" keep come back but I don't a mention about this kind of limitation.

If you can help me, I thank you in advance.

Tian-Dechao commented 10 months ago

A response to the error "The matrix is too spase or too small at this resolution to be calculated !" keep come back but I don't a mention about this kind of limitation.

It is a warning message. It is uncommon that Hi-C data at a given resolution, particularly high-resolution say 10 kb, are sparse in some TADs. We use two parameters to deal with this situation: --min_nbin and --f. Briefly, after preprocessing and data imputation, the two Hi-C contact matrices for a particularly TAD has fewer number of columns/rows than user predefined cutoff (--min_nbin), DiffDomain will assign a missing value as the P value and print the warning message "The matrix is too spase or too small at this resolution to be calculated !" Please see our description of the two parameters --min_nbin and --f here .

For a given resolution, if multiple warning messages prompt out, equivalently, many P values are missing values, it is a signal that a lower-resolution should be used instead. One extreme example is using 1 kb resolution Hi-C data to compare K562 TADs between K562 and KBM7 cell lines in Rao et al 2014 paper.

Tian-Dechao commented 10 months ago

A response to the ValueError: Genomic region out of bounds: [132160000, 133800000).

Based on the description, my educated guess is that the reference genome for the Hi-C data data1.mcool data2.mcool is different from the reference genome of the TAD list TAD_list_data1.bed. Please kindly let us know if this is the case.

FlavianR commented 10 months ago

Thanks for your reponses!!

For the TAD_list_data1.bed, the reference genome is the same as both data in mcool file. It has been generated by using SpectralTADs and use with some other tools.

Tian-Dechao commented 10 months ago

Co-author Ming Gu has another guess. It could be due to the rounding-up of the TAD region if the TAD covers the end of the chromosome. Please compare the chromosome size with 133800000, the end of the TAD region. If the chromosome size is smaller than 133800000, please modify the TAD region.

FlavianR commented 10 months ago

I will check if it's this. I will come back to you for an update on this.

And if you have a clue on the first issue i got, it would help :)

FlavianR commented 10 months ago

The check was pretty fast. My chromosome size is tell by hg38. And at the end of my TAD_List.bed for the chr10 the end of the domain is 133 720 000, which is less than 133 797 422.

Tian-Dechao commented 10 months ago

I will check if it's this. I will come back to you for an update on this.

And if you have a clue on the first issue i got, it would help :)

Apologize! Forgot to answer this question. The reason behind the issue ValueError: zero-size array to reduction operation maximum which has no identity is because certain k-off diagonal entries of a Hi-C contact matrix of a particular TAD are all missing values or Infinity values.

Based on the description that .hic file works fine, this happens only for .mcool files is because the user specified normalized method, say KR, is not correctly called in extracting the contact matrices from.mcool files. We will provide a quick fix. Please stay tuned.

mingbao96 commented 10 months ago

Hi @FlavianR ,

Thank you for reaching out with your questions.

Regarding the first issue, the ValueError: zero-size array to reduction operation maximum which has no identity, we have implemented a normalization process for .mcool and .cool files to correctly call user-specified correction methods such as KR.

As for the second question, ValueError: Genomic region out of bounds: [132160000, 133800000), we have updated the method for adjusting the upper bound of the TAD region according to a user-specific resolution.

You can access the latest version by updating with pip install diffDomain-py3 --upgrade. Please let us know if this resolves the issues or if you have any further questions.

Best regards.

FlavianR commented 10 months ago

Thanks for this update!