YaqiangCao / cLoops

Accurate and flexible loops calling tool for 3D genomic data.
https://yaqiangcao.github.io/cLoops/
MIT License
109 stars 17 forks source link

Issue with file preprocessing. #31

Closed BlackPianoCat closed 2 years ago

BlackPianoCat commented 2 years ago

So I have a .bedpe file whose head looks like that. I created the code so as to include also orientation and have the last three columns, but unfortunately it still does not work for me. The columns are separated with \t (as it is needed).

chr1    869398  870595  chr1    904618  906401  5   .   +   -
chr1    869398  870595  chr1    937699  942959  13  .   +   -
chr1    869398  870595  chr1    979636  987730  2   .   +   +
chr1    869398  870595  chr1    1001366 1003470 5   .   +   -
chr1    869398  870595  chr1    1058440 1061403 2   .   +   +
chr1    869398  870595  chr1    1118816 1123474 2   .   +   +
chr1    869398  870595  chr1    1250309 1252884 2   .   +   -
chr1    869398  870595  chr1    1290219 1292623 2   .   +   -
chr1    904618  906401  chr1    914193  915144  5   .   +   +

and the command that I am trying to run is something like,

cLoops -f GM12878WT_ChIAPET_SMC1A_B1S4B2S2B3S2_2.bedpe.gz -o cLoops_out -minPts 20,30 -eps 2500,5000,7500,10000 -hic -s -j -c chr21

as you propose in documentation. My purpose is to call cLoops, so as to find stripes (after it). The error that I take is,

2022-02-09 18:05:06,608 INFO Command line: cLoops -f GM12878WT_ChIAPET_SMC1A_B1S4B2S2B3S2_2.bedpe.gz -o cLoops_out -m 0 -eps 2500,5000,7500,10000 -minPts 20,30 -p 1 -w False -j True -s True -c chr21 -hic True -cut 0 -plot False -max_cut False
2022-02-09 18:05:06,632 INFO mode:0  eps:[2500, 5000, 7500, 10000]   minPts:[30, 20]     hic:True   
2022-02-09 18:05:06,632 INFO Parsing PETs from GM12878WT_ChIAPET_SMC1A_B1S4B2S2B3S2_2.bedpe.gz, requiring initial distance cutoff > 0
300000 PETs processed from GM12878WT_ChIAPET_SMC1A_B1S4B2S2B3S2_2.bedpe.gz()
2022-02-09 18:05:07,933 INFO Totaly 333808 PETs from GM12878WT_ChIAPET_SMC1A_B1S4B2S2B3S2_2.bedpe.gz, in which 3535 cis PETs
Clustering chr21 and chr21 using eps as 2500, minPts as 30,pre-set distance cutoff as > 0
Clustering chr21 and chr21 finished. Estimated 0 self-ligation reads and 0 inter-ligation reads
2022-02-09 18:05:07,960 INFO ERROR: no inter-ligation PETs detected for eps 2500 minPts 30,can't model the distance cutoff,continue anyway
Clustering chr21 and chr21 using eps as 2500, minPts as 20,pre-set distance cutoff as > 0
Clustering chr21 and chr21 finished. Estimated 0 self-ligation reads and 0 inter-ligation reads
2022-02-09 18:05:07,978 INFO ERROR: no inter-ligation PETs detected for eps 2500 minPts 20,can't model the distance cutoff,continue anyway
Clustering chr21 and chr21 using eps as 5000, minPts as 30,pre-set distance cutoff as > 0
Clustering chr21 and chr21 finished. Estimated 0 self-ligation reads and 0 inter-ligation reads
2022-02-09 18:05:07,996 INFO ERROR: no inter-ligation PETs detected for eps 5000 minPts 30,can't model the distance cutoff,continue anyway
Clustering chr21 and chr21 using eps as 5000, minPts as 20,pre-set distance cutoff as > 0
Clustering chr21 and chr21 finished. Estimated 0 self-ligation reads and 0 inter-ligation reads
2022-02-09 18:05:08,015 INFO ERROR: no inter-ligation PETs detected for eps 5000 minPts 20,can't model the distance cutoff,continue anyway
Clustering chr21 and chr21 using eps as 7500, minPts as 30,pre-set distance cutoff as > 0
Clustering chr21 and chr21 finished. Estimated 0 self-ligation reads and 0 inter-ligation reads
2022-02-09 18:05:08,034 INFO ERROR: no inter-ligation PETs detected for eps 7500 minPts 30,can't model the distance cutoff,continue anyway
Clustering chr21 and chr21 using eps as 7500, minPts as 20,pre-set distance cutoff as > 0
Clustering chr21 and chr21 finished. Estimated 0 self-ligation reads and 0 inter-ligation reads
2022-02-09 18:05:08,052 INFO ERROR: no inter-ligation PETs detected for eps 7500 minPts 20,can't model the distance cutoff,continue anyway
Clustering chr21 and chr21 using eps as 10000, minPts as 30,pre-set distance cutoff as > 0
Clustering chr21 and chr21 finished. Estimated 0 self-ligation reads and 0 inter-ligation reads
2022-02-09 18:05:08,070 INFO ERROR: no inter-ligation PETs detected for eps 10000 minPts 30,can't model the distance cutoff,continue anyway
Clustering chr21 and chr21 using eps as 10000, minPts as 20,pre-set distance cutoff as > 0
Clustering chr21 and chr21 finished. Estimated 0 self-ligation reads and 0 inter-ligation reads
2022-02-09 18:05:08,089 INFO ERROR: no inter-ligation PETs detected for eps 10000 minPts 20,can't model the distance cutoff,continue anyway
Traceback (most recent call last):
  File "/home/blackpianocat/anaconda3/envs/cLoops/bin/cLoops", line 11, in <module>
    load_entry_point('cLoops==0.93', 'console_scripts', 'cLoops')()
  File "build/bdist.linux-x86_64/egg/cLoops/pipe.py", line 349, in main
  File "build/bdist.linux-x86_64/egg/cLoops/pipe.py", line 280, in pipe
  File "/home/blackpianocat/anaconda3/envs/cLoops/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 2618, in amin
    initial=initial)
  File "/home/blackpianocat/anaconda3/envs/cLoops/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation minimum which has no identity

I also tried your new script cLoops2, but I still have problems since after the preprocessing it gives me empty files.

YaqiangCao commented 2 years ago

So I have a .bedpe file whose head looks like that. I created the code so as to include also orientation and have the last three columns, but unfortunately it still does not work for me. The columns are separated with \t (as it is needed).

chr1  869398  870595  chr1    904618  906401  5   .   +   -
chr1  869398  870595  chr1    937699  942959  13  .   +   -
chr1  869398  870595  chr1    979636  987730  2   .   +   +
chr1  869398  870595  chr1    1001366 1003470 5   .   +   -
chr1  869398  870595  chr1    1058440 1061403 2   .   +   +
chr1  869398  870595  chr1    1118816 1123474 2   .   +   +
chr1  869398  870595  chr1    1250309 1252884 2   .   +   -
chr1  869398  870595  chr1    1290219 1292623 2   .   +   -
chr1  904618  906401  chr1    914193  915144  5   .   +   +

and the command that I am trying to run is something like,

cLoops -f GM12878WT_ChIAPET_SMC1A_B1S4B2S2B3S2_2.bedpe.gz -o cLoops_out -minPts 20,30 -eps 2500,5000,7500,10000 -hic -s -j -c chr21

as you propose in documentation. My purpose is to call cLoops, so as to find stripes (after it). The error that I take is,

2022-02-09 18:05:06,608 INFO Command line: cLoops -f GM12878WT_ChIAPET_SMC1A_B1S4B2S2B3S2_2.bedpe.gz -o cLoops_out -m 0 -eps 2500,5000,7500,10000 -minPts 20,30 -p 1 -w False -j True -s True -c chr21 -hic True -cut 0 -plot False -max_cut False
2022-02-09 18:05:06,632 INFO mode:0    eps:[2500, 5000, 7500, 10000]   minPts:[30, 20]     hic:True   
2022-02-09 18:05:06,632 INFO Parsing PETs from GM12878WT_ChIAPET_SMC1A_B1S4B2S2B3S2_2.bedpe.gz, requiring initial distance cutoff > 0
300000 PETs processed from GM12878WT_ChIAPET_SMC1A_B1S4B2S2B3S2_2.bedpe.gz()
2022-02-09 18:05:07,933 INFO Totaly 333808 PETs from GM12878WT_ChIAPET_SMC1A_B1S4B2S2B3S2_2.bedpe.gz, in which 3535 cis PETs
Clustering chr21 and chr21 using eps as 2500, minPts as 30,pre-set distance cutoff as > 0
Clustering chr21 and chr21 finished. Estimated 0 self-ligation reads and 0 inter-ligation reads
2022-02-09 18:05:07,960 INFO ERROR: no inter-ligation PETs detected for eps 2500 minPts 30,can't model the distance cutoff,continue anyway
Clustering chr21 and chr21 using eps as 2500, minPts as 20,pre-set distance cutoff as > 0
Clustering chr21 and chr21 finished. Estimated 0 self-ligation reads and 0 inter-ligation reads
2022-02-09 18:05:07,978 INFO ERROR: no inter-ligation PETs detected for eps 2500 minPts 20,can't model the distance cutoff,continue anyway
Clustering chr21 and chr21 using eps as 5000, minPts as 30,pre-set distance cutoff as > 0
Clustering chr21 and chr21 finished. Estimated 0 self-ligation reads and 0 inter-ligation reads
2022-02-09 18:05:07,996 INFO ERROR: no inter-ligation PETs detected for eps 5000 minPts 30,can't model the distance cutoff,continue anyway
Clustering chr21 and chr21 using eps as 5000, minPts as 20,pre-set distance cutoff as > 0
Clustering chr21 and chr21 finished. Estimated 0 self-ligation reads and 0 inter-ligation reads
2022-02-09 18:05:08,015 INFO ERROR: no inter-ligation PETs detected for eps 5000 minPts 20,can't model the distance cutoff,continue anyway
Clustering chr21 and chr21 using eps as 7500, minPts as 30,pre-set distance cutoff as > 0
Clustering chr21 and chr21 finished. Estimated 0 self-ligation reads and 0 inter-ligation reads
2022-02-09 18:05:08,034 INFO ERROR: no inter-ligation PETs detected for eps 7500 minPts 30,can't model the distance cutoff,continue anyway
Clustering chr21 and chr21 using eps as 7500, minPts as 20,pre-set distance cutoff as > 0
Clustering chr21 and chr21 finished. Estimated 0 self-ligation reads and 0 inter-ligation reads
2022-02-09 18:05:08,052 INFO ERROR: no inter-ligation PETs detected for eps 7500 minPts 20,can't model the distance cutoff,continue anyway
Clustering chr21 and chr21 using eps as 10000, minPts as 30,pre-set distance cutoff as > 0
Clustering chr21 and chr21 finished. Estimated 0 self-ligation reads and 0 inter-ligation reads
2022-02-09 18:05:08,070 INFO ERROR: no inter-ligation PETs detected for eps 10000 minPts 30,can't model the distance cutoff,continue anyway
Clustering chr21 and chr21 using eps as 10000, minPts as 20,pre-set distance cutoff as > 0
Clustering chr21 and chr21 finished. Estimated 0 self-ligation reads and 0 inter-ligation reads
2022-02-09 18:05:08,089 INFO ERROR: no inter-ligation PETs detected for eps 10000 minPts 20,can't model the distance cutoff,continue anyway
Traceback (most recent call last):
  File "/home/blackpianocat/anaconda3/envs/cLoops/bin/cLoops", line 11, in <module>
    load_entry_point('cLoops==0.93', 'console_scripts', 'cLoops')()
  File "build/bdist.linux-x86_64/egg/cLoops/pipe.py", line 349, in main
  File "build/bdist.linux-x86_64/egg/cLoops/pipe.py", line 280, in pipe
  File "/home/blackpianocat/anaconda3/envs/cLoops/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 2618, in amin
    initial=initial)
  File "/home/blackpianocat/anaconda3/envs/cLoops/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation minimum which has no identity

I also tried your new script cLoops2, but I still have problems since after the preprocessing it gives me empty files.

Hi Dear User, Could you please share a small chromosome such as chr21 that I can have a close check? Please share to caoyaqiang0410@gmail.com It seems your data is ChIA-PET data, I could suggest run with -eps 1000 -minPts 10 for a initial trial. Best, Yaqiang

BlackPianoCat commented 2 years ago

Good morning,

Thank you for your fast answer. I sent you the email with the data, and I also checked to run it with the parameters you proposed me, but I still have the same error.

GM12878WT_ChIAPET_SMC1A_B1S4B2S2B3S2_2.bedpe.gz

YaqiangCao commented 2 years ago

Good morning,

Thank you for your fast answer. I sent you the email with the data, and I also checked to run it with the parameters you proposed me, but I still have the same error.

GM12878WT_ChIAPET_SMC1A_B1S4B2S2B3S2_2.bedpe.gz

Hi, The file can be processed by cLoops2 pre. I tried to convert it through cLoops2 dump -washU, and it seems hard to observe loops in the genome browser due to too few PETs. To my knowledge, ideally for ChIA-PET data, there should be more than 20 million PETs. Not sure how many you have and if the library passed the quality control. Best, Yaqiang

BlackPianoCat commented 2 years ago

So to create this file I did some filtering to find the CTCF motifs orientation, this script keeps only the lines that it is able to find these motifs and discards all the other ones. So probably I must change some parameter of my script to have a more detailed file. Thank you!

BlackPianoCat commented 2 years ago

Good morning. Unfortunately, I did not succeed to resolve my issue. I used a scrip so as to find the CTCF motifs and complete the columns with + and -, however I am not sure if it works and if it is a correct procedure (I am still new in bioinformatics).

The other thing that I tried was to use your hicpropairs2bedpe.py script which is supposed to convert a .hic file to .bedpe. So I started from I .hic file and I still have the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8d in position 40: invalid start byte

Therefore, if it is easy for you, I would like ask you one simple question: with what input data your algorithm works. With what kind data should I start and what kind of preprocessing should I do?

YaqiangCao commented 2 years ago

Good morning. Unfortunately, I did not succeed to resolve my issue. I used a scrip so as to find the CTCF motifs and complete the columns with + and -, however I am not sure if it works and if it is a correct procedure (I am still new in bioinformatics).

The other thing that I tried was to use your hicpropairs2bedpe.py script which is supposed to convert a .hic file to .bedpe. So I started from I .hic file and I still have the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8d in position 40: invalid start byte

Therefore, if it is easy for you, I would like ask you one simple question: with what input data your algorithm works. With what kind data should I start and what kind of preprocessing should I do?

hicpropairs2bedpe.py was used to convert .allValidPairs file to .BEDPE file. HIC file is not supported.

YaqiangCao commented 2 years ago

Good morning. Unfortunately, I did not succeed to resolve my issue. I used a scrip so as to find the CTCF motifs and complete the columns with + and -, however I am not sure if it works and if it is a correct procedure (I am still new in bioinformatics). The other thing that I tried was to use your hicpropairs2bedpe.py script which is supposed to convert a .hic file to .bedpe. So I started from I .hic file and I still have the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8d in position 40: invalid start byte

Therefore, if it is easy for you, I would like ask you one simple question: with what input data your algorithm works. With what kind data should I start and what kind of preprocessing should I do?

hicpropairs2bedpe.py was used to convert .allValidPairs file to .BEDPE file. HIC file is not supported.

For Hi-C data, HiCPro for preprocessing, hicpropairs2bedpe.py to BEDPE file as input of cLoops2 . For ChIA-PET data, processed PETs into BEDPE, with the preprocessing tools of mango or ChIA-PET Tools. Also , a test data actually provided at https://github.com/YaqiangCao/cLoops/tree/master/examples.

BlackPianoCat commented 2 years ago

Yes, I have checked the test file, thank you for your information. Finally, I proceed by converting ChIA-PET to BEDPE with straw. It works but I still see a lot of false positives in loops and I cannot detect stripes. I believe that this is related to the tuning of parameters. I will check also your the software you proposed me for the preprocessing. Thank you again!