ay-lab / HiCtrans

HiCtrans is a pipeline to call translocations from Hi-C data
15 stars 5 forks source link

Too much false positive? #3

Closed ChenFengling closed 6 years ago

ChenFengling commented 6 years ago

I use Hictrans to call translocation from T47D used in your bioinformatic paper, but the results show many false positives. check following results. The translocation are detected on the first two right spot which means my input is correct

image image

but the the detected translocation also present at more false place!

image image

In fact, it is not the single case, I detected "942" translocation event in T47D instead of 25 site reported in your paper! It is confusing.

ay-lab commented 6 years ago

Are you using the 40Kb resolution to generate the result? There is a "count_filter" option in the "run_HiCtrans.pl". This filter will remove the breakpoints below a threshold (generally a global inter-chromosomal mean count). You can change that based on your data and resolution.

Also, can you please attach the translocation result file. I will have a look at the data.

In general, a translocation can have multiple breakpoints but not 942. There is some filtering issue here.

ay-lab commented 6 years ago

Just an update, I ran HiCtrans on T47D (https://www.encodeproject.org/experiments/ENCSR549MGQ/) chr1-chr3 file and it showed no translocation event within the pair.

ChenFengling commented 6 years ago

I use the same dataset but the preprocessing step is different as I generate the contact map using .hic file and reformat the spare matrix. I run HiCtrans in 40kb contact maps in T47D and the results is attached. I find the filter value is 10 and the average inter-chromosomal mean count is 1~2. All.chromosome.Translocation.zip

ay-lab commented 6 years ago

I guess when you are creating the hic file you are using the juicer "Pre" command. Can you check if it is normalizing the counts by some means before creating the hic file? HiCtrans expects raw contact counts.

In the paper, we have used HiC-Pro pipeline processed data. Although the normalized counts may vary between the pipelines the raw counts should not differ. In the HiC-Pro processed file, the highest contact count between chr1-chr3 pair is 7 while as per your attached file there is an interaction that has a count value of 60.

ChenFengling commented 6 years ago

I dump raw data from .hic file and I check the validpairs file and found out the 60 contacts between "chr3 144600000 144640000" and "chr1 207520000 207560000" . I don't why but using different processing step even drives the raw data different.

I check chr1_ch3 40kb matrix and find 274 cell has >3 contact. These cells are sparsely distributed on the map. Let's consider it as noise. I find out the HiCtrans is very sensitive to the "local noise" with high contact. Here I plot the translocation output and cell with contact >3.

image image image

Altough that must have something wrong in the preprocessing step, the method is better to avoid these false positives.

ay-lab commented 6 years ago

I just checked that ~23 out of 25 translocations that we report in the paper are also present in the list. Most of the regions which didn't have any count in the HiC-Pro processed file are showing very high counts here (> 30/40). With a background level of ~2 counts, it is no surprise that it is picking that up.

If possible, I would request you to please upload this chr1-chr3 file here. I will have a look at it in detail.

ChenFengling commented 6 years ago

Thanks for your kindly and quick response! Attached is my file. chr1-chr3.zip

ChenFengling commented 6 years ago

I fould these "noise" was caused from multiple mapping events. When I use juicer, it did't filter some multiple mapping events which results in the extremely high contacts in some cells of the map. I will fix this issue. However I still think HiCtrans should have some statistical test to avoid these false positives as translocation are shown as a whole local pattern difference not a single cell difference.

ay-lab commented 6 years ago

Thanks for the file and figuring out the mapping problem, it is an extremely important step to consider. There is a changepoint statistical test in the translocationFind.r script to avoid such instances, but what we found out that some translocations are made up of multiple breakpoints of different sizes (e.g. NCIH460 translocation in the paper), and by applying too strict condition we tend to lose such known translocations. But I agree this is a concern and thanks for raising it. I will have a look at it and try to make it an optimal one.

abhijitcbio commented 6 years ago

Fyi, I have updated the package so as to avoid the noises.