deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
233 stars 70 forks source link

hicDifferentialTAD skips regions #725

Closed dawe closed 3 years ago

dawe commented 3 years ago

Hi all, I'm running hicDifferentialTAD on TADs identified by hicFindTADs on my data. I've noticed that some domains are not included in the accepted nor in the rejected list:

$ wc -l 164_vs_168_5000_*
   3583 164_vs_168_5000_accepted.diff_tad
   7845 164_vs_168_5000_rejected.diff_tad

$ wc -l 164_normalized_5000_KR_domains.bed 
11474 164_normalized_5000_KR_domains.bed

$ cat 164_vs_168_5000_accepted.diff_tad 164_vs_168_5000_rejected.diff_tad  | bedtools  sort  | bedtools intersect -a 164_normalized_5000_KR_domains.bed -b stdin -v | wc -l
54

TADs were computed like this

MATRIX=${SAMPLE}_normalized_${BINSIZE}_KR.cool
hicFindTADs --matrix $MATRIX --outPrefix ${MATRIX%%.cool} --correctForMultipleTesting fdr -p 4

and differences like this

TMATRIX=${TARGET}_normalized_${BINSIZE}_KR.cool
CMATRIX=${CONTROL}_normalized_${BINSIZE}_KR.cool
DOMAINS=${TARGET}_normalized_${BINSIZE}_KR_domains.bed
hicDifferentialTAD -tm $TMATRIX -cm $CMATRIX -td $DOMAINS -o ${TARGET}_vs_${CONTROL}_${BINSIZE} -t 4

Looking at the position of the skipped TAD in the domains file, I realized they are the first and last TAD of each chromosome, which may make sense when testing for left or right inter-TAD differences, but not for intra-TAD difference.

Of note, I've tried to run hicDifferentialTAD with -m intra-TAD option but I get weird results, only two domains are tested (the first 2 on chromosome 1) and this test is repeated hundreds of times.

joachimwolff commented 3 years ago

Hi,

thanks for testing this and giving me feedback. I will have a look at this and come back to you.

Best,

Joachim

joachimwolff commented 3 years ago

As a fast issue to test: Can you run it with one thread instead of multiple ones? Could be a parallelization issue if that solves it.

dawe commented 3 years ago

I've tried to use one thread, it does something but... A step back: looking at the TAD IDs, using the multithreaded version I have these TADs skipped:

1       246700000       247700000       ID_0.01_168
2       1800000 3800000 ID_0.01_169
2       241100000       241800000       ID_0.01_339
3       3300000 4300000 ID_0.01_340
3       194100000       195700000       ID_0.01_481
4       1800000 3000000 ID_0.01_482
4       188500000       189400000       ID_0.01_618
5       1700000 3700000 ID_0.01_619
5       3700000 4900000 ID_0.01_620
5       177600000       179000000       ID_0.01_747
6       1500000 2700000 ID_0.01_748
6       2700000 4000000 ID_0.01_749
6       168500000       169300000       ID_0.01_871
7       1600000 2500000 ID_0.01_872
7       2500000 4600000 ID_0.01_873
7       155600000       157200000       ID_0.01_990
X       3800000 7900000 ID_0.01_991
X       7900000 8800000 ID_0.01_992
X       153000000       153800000       ID_0.01_1093
8       2000000 3200000 ID_0.01_1094
8       3200000 4900000 ID_0.01_1095
8       143700000       144400000       ID_0.01_1206
9       1000000 2100000 ID_0.01_1207
9       2100000 2900000 ID_0.01_1208
9       138500000       139600000       ID_0.01_1282
10      1400000 3200000 ID_0.01_1283
10      3200000 4800000 ID_0.01_1284
10      132300000       133700000       ID_0.01_1380
11      2200000 4200000 ID_0.01_1381
11      4200000 5100000 ID_0.01_1382
11      132900000       133800000       ID_0.01_1482
12      2100000 3200000 ID_0.01_1483
12      3200000 4000000 ID_0.01_1484
13      20700000        22200000        ID_0.01_1593
13      22200000        23300000        ID_0.01_1594
14      20200000        22100000        ID_0.01_1665
14      22100000        22700000        ID_0.01_1666
15      22600000        23900000        ID_0.01_1730
15      23900000        25800000        ID_0.01_1731
16      2000000 3100000 ID_0.01_1795
16      3100000 4400000 ID_0.01_1796
17      1000000 2300000 ID_0.01_1855
17      2300000 3500000 ID_0.01_1856
18      2600000 4000000 ID_0.01_1915
18      4000000 5000000 ID_0.01_1916
20      1500000 2700000 ID_0.01_1969
20      2700000 3700000 ID_0.01_1970
19      2700000 4700000 ID_0.01_2010
19      4700000 5700000 ID_0.01_2011
22      17400000        18800000        ID_0.01_2048
22      18800000        20200000        ID_0.01_2049
21      15600000        17000000        ID_0.01_2070
21      17000000        18700000        ID_0.01_2071

As you can see, the last of each chromosome and the next TAD (the first on the next chromosome) are skipped. The first TAD of chromosome 1, however, is in. Switching to a single thread (-t 1) solves the "last TAD" but not the "first TAD":

2       1800000 3800000 ID_0.01_169
3       3300000 4300000 ID_0.01_340
4       1800000 3000000 ID_0.01_482
5       1700000 3700000 ID_0.01_619
6       1500000 2700000 ID_0.01_748
7       1600000 2500000 ID_0.01_872
X       3800000 7900000 ID_0.01_991
8       2000000 3200000 ID_0.01_1094
9       1000000 2100000 ID_0.01_1207
10      1400000 3200000 ID_0.01_1283
11      2200000 4200000 ID_0.01_1381
12      2100000 3200000 ID_0.01_1483
13      20700000        22200000        ID_0.01_1593
14      20200000        22100000        ID_0.01_1665
15      22600000        23900000        ID_0.01_1730
16      2000000 3100000 ID_0.01_1795
17      1000000 2300000 ID_0.01_1855
18      2600000 4000000 ID_0.01_1915
20      1500000 2700000 ID_0.01_1969
19      2700000 4700000 ID_0.01_2010
22      17400000        18800000        ID_0.01_2048
21      15600000        17000000        ID_0.01_2070
dawe commented 3 years ago

A quick workaround: I padded TADs adding the regions from 0 to first TAD and from last TAD to end of chromosome, for each chromosome. Then I ran hicDifferentialTAD in single thread mode:

$ bedtools complement -g ${GENOME}.tab -i ${DOMAINS} | cat -n | awk '{OFS="\t"; print $2, $3, $4, "ID_COMP_"$1, 0, "."}'  > TAD_Complement.bed

$ cat TAD_Complement.bed ${DOMAINS} | cut -f 1-6 | bedtools sort > TAD_PAD.bed

then used TAD_PAD.bed as argument to -td flag

joachimwolff commented 3 years ago

Ok, this means we have somewhere a plus one issue checking the boundary conditions.

dawe commented 3 years ago

Could #722 be related to this? I hope so :-)

joachimwolff commented 3 years ago

This bug could be fixed with the changes in the current develop branch. Can you check this @dawe ? Thanks a lot!