Tian-Dechao / diffDomain

DiffDomain is a statistically sound method for detecting differential TADs between conditions
MIT License
13 stars 4 forks source link

1 not found in the file using 'dvsd multiple' #5

Closed tolender closed 1 year ago

tolender commented 1 year ago

diffDomain is working for "dvsd one", however I am unable to get "dvsd multiple" to work with 2 hic files and a bed file.

Using the code below, I get "1 not found in the file." repeated for the total number of TADs in the bed file, then the program hangs indefinitely. If the chr column is populated with "chr1", I get the message below. If the chr column is populated with "1", I get the message below. If I explicitly add --chrn "1" to the command, I get the message below. If I explicitly add --chrn "chr1" to the command, the program immediately finishes and the output file only has the # commented rows.

_python /home/########/diffDomainENV/diffDomain/diffdomain-py3/diffdomains.py dvsd multiple Control.hic Treatment.hic Control_TADs_chr1.txt --ofile Ctrl.vs.Treatment.at.Ctrl_chr1.txt --hicnorm KR --ncore 1 --reso 10000 --minnbin 5 diffdomains.py 1 not found in the file. 1 not found in the file. 1 not found in the file. 1 not found in the file. etc etc etc.

mingbao96 commented 1 year ago

Have you tested the two hic files in 'dvsd one' on chr1, just like below? python diffdomains.py dvsd one 1 1000000 1700000 Control.hic Treatment.hic --reso 50000

If you can run them and get results from the two files on chr1 using 'dvsd one' respectively, there may be some neglected errors in our scripts. If not, that means hic-straw can not find the 1st chromosome in the hic files.

Actually, in our scripts, if we find 'chr1' in your tadlist, we will set it as '1', and hic-straw can match chromosome name automatically.

In addition, maybe we should confirm the format of tadlist. We expect a tadlist like below, whose first column is chromosome name, second column is the locus where the TAD starts, and third column is the locus where the TAD ends. We will only use the first 3 columns in a tadlist (framed in red). And whatever its column names are, it should have a header (framed in green).

image

If you'd like to stay in touch, we will be grateful. Thanks for your attention.

tolender commented 1 year ago

Your suggestion produces an error as inputting a single HIC outputs the Usage.

Here are two regions that I know are the same and different between control and treatment.

This is a TAD that does not change at all when viewed on Juicebox

$ python /project/###########/PythonEnv/diffDomainENV/diffDomain/diffdomain-py3/diffdomains.py dvsd one chr1 69220001 69830000 Control.hic Treatment.hic --hicnorm SCALE --ofile Test1.txt --ncore 1 --reso 10000 diffdomains.py 0 0 chr1 1 69220001 2 69830000 3 chrchr1:69220001-69830000 4 3.566469 5 0.00055 6 61

Here is a TAD that clearly changes when viewed on Juicebox.

$ python /project/###########/PythonEnv/diffDomainENV/diffDomain/diffdomain-py3/diffdomains.py dvsd one chr14 24950000 25580000 Control.hic Treatment.hic --hicnorm SCALE --ofile Test2.txt --ncore 1 --reso 10000 diffdomains.py 0 0 chr14 1 24950000 2 25580000 3 chrchr14:24950000-25580000 4 6.293985 5 0.000001 6 63

What is notable is 'chrchr' being written in row 3 in both cases, and the region that definitely does not change still has a p-value of 0.00055. The output txt file also writes out the files vertically and not as a tab separate row.

Also, if it is helpful is identifying the problem, if I replace "chr14" with "14", I get a similar output to the initial issue. $ python /project/############/PythonEnv/diffDomainENV/diffDomain/diffdomain-py3/diffdomains.py dvsd one 14 24950000 25580000 Control.hic Treatment.hic --hicnorm SCALE --ofile Test2.txt --ncore 1 --reso 10000 diffdomains.py 14 not found in the file.

Additionally, Adding the header "chr1 x1 x2" to the top of the bed file makes no difference to the output of the initial issue.

Which .hic version should be used with for diffDomain? I believe the latest .hic is v9 and my .hic files may be a few versions behind. https://github.com/aidenlab/hictools

Many thanks!

mingbao96 commented 1 year ago

About the initial problem, based on the information you provided, I guess the issue is hic-straw can not match the chromosome names correctly. When you use 'dvsd one', hic-straw will directly find the chromosome matched with 'chr14'. But when you use 'dvsd multiple', after loading tadlist, hic-straw will try to find the chromosome matched with '14'.

I think we should adjust the function for loading tadlist, so I have updated the package.You can update it by pip. After setting the chromosome names like 'chr1' , please try 'dvsd multiple' again.

We are grateful.

mingbao96 commented 1 year ago

Regarding the picked TAD chr1:69220001-69830000, it is likely a false positive based on your description. In general, DiffDomain controls FPR when simultaneously comparing multiple TADs. We recommend trying the "dvsd multiple" to test TADs from all chromosomes.

mingbao96 commented 1 year ago

About the output format of 'dvsd one', the result of 'dvsd one' indeed was written as one column. I have adjusted it to keep the same with 'dvsd multiple'.

mingbao96 commented 1 year ago

About adding a header to tadlist, yes, it can only avoid omitting the first TAD we want to test.

In addition, to our best knowledge, hic version is not an issue.

tolender commented 1 year ago

I appreciate your rapid fixes to the program! I updated to diffDomain_py3_0.1.3

The output for "dvsd one" still works correctly and indeed outputs in row format now, thanks for that.

I am getting a new error regarding syntax.

_python /home/######/software/anaconda3/lib/python3.9/site-packages/diffdomain_py3/diffdomains.py dvsd multiple CONTROL.hic TREATMENT.hic CONTROL_chr1.txt --hicnorm SCALE --ofile Test3.txt --ncore 1 --reso 10000
File "/home/######/software/anaconda3/lib/python3.9/site-packages/diffdomainpy3/diffdomains.py", line 67 P = Pool(int(opts['--ncore'])) ^ SyntaxError: invalid syntax

mingbao96 commented 1 year ago

Thanks, I met this error in my local test and fixed it. The newest version is 0.1.7, you can reinstall and try it.

tolender commented 1 year ago

SUCCESS!!!

I have run 'dvsd multiple' for all chromosomes in a single bed file, so it is working! I also ran 'dvsd one', adjustment, and classification.py with the output and it is all working without errors.

There is one question I had before closing (to see if it is a expected or a bug). I noticed that at 10kb resolution, about 9% of all my TADs were outputting empty score and pvalue columns, but only 0.6% of peaks were having the same problem at 25kb (example below).

Is there a recommended lower limit to resolution? These TADs look normal when viewed in Juicebox, and they range anywhere from 200kb to 3MB. I am just curious if this algorithm is sensitive to higher resolutions or low-mappability regions or some other HiC artefacts .

Example output analysis performed at 10kb chr1 | 6290000 | 6390000 | chr1:6290000-6390000 | -1.71082 | 0.639974 | 10 chr1 | 6390000 | 7230000 | chr1:6390000-7230000 | 0.154243 | 0.141746 | 84 chr1 | 6390000 | 6510000 | chr1:6390000-6510000 | -1.09294 | 0.444759 | 12 chr1 | 7060000 | 7230000 | chr1:7060000-7230000 | 0.004967 | 0.167188 | 17 chr1 | 7480000 | 9240000 | chr1:7480000-9240000 |   |   | 176 chr1 | 7480000 | 9120000 | chr1:7480000-9120000 |   |   | 164 chr1 | 7480000 | 7860000 | chr1:7480000-7860000 | -0.54865 | 0.288654 | 38 chr1 | 7860000 | 9120000 | chr1:7860000-9120000 |   |   | 126 chr1 | 7860000 | 7960000 | chr1:7860000-7960000 | -0.6523 | 0.315968 | 10 chr1 | 7960000 | 8210000 | chr1:7960000-8210000 | -0.16891 | 0.200731 | 25 chr1 | 9120000 | 9240000 | chr1:9120000-9240000 | -0.58315 | 0.297597 | 12 chr1 | 9240000 | 10730000 | chr1:9240000-10730000 | -6.84651 | 1 | 148 chr1 | 9240000 | 9540000 | chr1:9240000-9540000 |   |   | 30 chr1 | 9540000 | 10730000 | chr1:9540000-10730000 |   |   | 119 chr1 | 9540000 | 10260000 | chr1:9540000-10260000 | -1.78188 | 0.661726 | 72 chr1 | 9540000 | 9700000 | chr1:9540000-9700000 | 0.137286 | 0.144483 | 16 chr1 | 9700000 | 10260000 | chr1:9700000-10260000 | 0.08217 | 0.153649 | 56 chr1 | 9760000 | 10260000 | chr1:9760000-10260000 | -1.65472 | 0.622564 | 50 chr1 | 9760000 | 10030000 | chr1:9760000-10030000 | -1.45612 | 0.559782 | 27 chr1 | 9760000 | 9960000 | chr1:9760000-9960000 | -1.61445 | 0.609955 | 20 chr1 | 10260000 | 10730000 | chr1:10260000-10730000 | -1.9394 | 0.708391 | 47 chr1 | 10730000 | 10910000 | chr1:10730000-10910000 | -1.66329 | 0.625234 | 18

Example output analysis performed at 25kb resolution chr1 | 6290000 | 7230000 | chr1:6290000-7230000 | -2.45335 | 0.83879 | 39 chr1 | 6390000 | 6510000 | chr1:6390000-6510000 | -2.06501 | 0.74369 | 6 chr1 | 6390000 | 7230000 | chr1:6390000-7230000 | -1.14184 | 0.46000 | 35 chr1 | 7060000 | 7230000 | chr1:7060000-7230000 | -1.80411 | 0.66845 | 8 chr1 | 7410000 | 7480000 | chr1:7410000-7480000 | -1.42191 | 0.54887 | 4 chr1 | 7480000 | 7860000 | chr1:7480000-7860000 | -0.13597 | 0.19405 | 16 chr1 | 7480000 | 9120000 | chr1:7480000-9120000 | 0.95000 | 0.05207 | 66 chr1 | 7480000 | 9240000 | chr1:7480000-9240000 | -0.05849 | 0.17894 | 71 chr1 | 7860000 | 7960000 | chr1:7860000-7960000 | -1.52820 | 0.58271 | 5 chr1 | 7960000 | 8210000 | chr1:7960000-8210000 | -1.98174 | 0.72050 | 11 chr1 | 9120000 | 9240000 | chr1:9120000-9240000 | -1.39040 | 0.53881 | 6 chr1 | 9240000 | 9540000 | chr1:9240000-9540000 | -1.99368 | 0.72388 | 13 chr1 | 9240000 | 10730000 | chr1:9240000-10730000 | 1.60547 | 0.01976 | 61 chr1 | 9540000 | 9700000 | chr1:9540000-9700000 | -1.48665 | 0.56951 | 7 chr1 | 9540000 | 10260000 | chr1:9540000-10260000 | 0.21241 | 0.13265 | 30 chr1 | 9540000 | 10730000 | chr1:9540000-10730000 | -0.45794 | 0.26587 | 49 chr1 | 9700000 | 9760000 | chr1:9700000-9760000 | -1.61932 | 0.61148 | 3 chr1 | 9700000 | 10260000 | chr1:9700000-10260000 | -1.47112 | 0.56456 | 23 chr1 | 9760000 | 9960000 | chr1:9760000-9960000 | -2.09243 | 0.75113 | 9 chr1 | 9760000 | 10030000 | chr1:9760000-10030000 | -0.05813 | 0.17887 | 12 chr1 | 9760000 | 10260000 | chr1:9760000-10260000 | -0.98431 | 0.41144 | 21 chr1 | 9960000 | 10030000 | chr1:9960000-10030000 | -1.67134 | 0.62774 | 4 chr1 | 10260000 | 10730000 | chr1:10260000-10730000 | -1.26693 | 0.49946 | 20 chr1 | 10730000 | 10910000 | chr1:10730000-10910000 | -1.88362 | 0.69214 | 8

mingbao96 commented 1 year ago

The factor is the sequencing coverage. When sequencing coverage is low but a high resolution, say 1 kb, is chosen, the Hi-C contact matrix of a TAD has a high percentage of missing values. So we set a filtering parameter using ‘--f’ to remove Hi-C contact matrices with high percentages of missing values.

For example, when setting‘--f 0.6’, in the contact matrix of a TAD, if the number of the columns, whose proportions of missing values is lower than 40%, is smaller than min_nbin, DiffDomain will skip comparing this TAD anymore and set its result (statistics, the 5th column ; P value, the 6th column) as NAN.

tolender commented 1 year ago

I see, I will close this issue as we are getting off topic, but is it recommended to stay with the default '--f=0.5' only at the default resolution of 100kb? Should it be changed if the resolution is decreased or is that not recommended? ex. can the --f be reduced to 0.3 or 0.2 at 5/10/25kb, or will that start introducing uncertainty in the matrix algorithm due to sparsity?

Thank you again for being so fast in the migration to py3 and updating the code so quickly! It is greatly appreciated!