ay-lab / dcHiC

dcHiC: Differential compartment analysis for Hi-C datasets
MIT License
57 stars 10 forks source link

FitHiC index error - ExpCC not calculated #30

Closed cvahlensieck closed 2 years ago

cvahlensieck commented 2 years ago

Hi, Thank you for supplying such an excellent tool!

When calling fithic with Rscript /srv/jh_users/cvahlensieck/dcHiC/dchicf.r --file input_zgf.txt --pcatype fithic --dirovwt T --diffdir diff_zgf_100k --fithicpath "/opt/jupyterhub/lib/python3.9/site-packages/fithic/fithic.py" --pythonpath "/opt/jupyterhub/bin/python", the script crashes in line 1611:

Error in[.data.table(mat_rep[[j]][.(ids_rep)], , 12) : Item 1 of j is 12 which is outside the column number range [1,ncol=11] Calls: fithicformat -> unlist -> [ -> [.data.table Execution halted

I could trace back the error to an output file of FitHiC, FitHiC.spline_pass1.res100000.significances.txt.gz. This file only contains 9 instead of 10 columns, lacking the ExpCC column. This is why the resulting table in the function fithicformat in dchicf.r only contains 11 columns, which leads to the error described above. Additionally, bias1 and bias2 in the FitHiC output file are always 1. Therefore I am not sure if this error emerges from a programming bug or if it is an issue with my data.

Thanks! Christian

ay-lab commented 2 years ago

Hi Christian,

Thanks for using the tools. Do you see the following files? DifferentialResult/diff_zgf_100k/fithic_run/<_fithic>/fragments.txt.gz DifferentialResult/diff_zgf_100k/fithic_run/<_fithic>/interactions.txt.gz

A good interactions.txt.gz file will look like the following chr1 3050000 chr1 3050000 4457 chr1 3050000 chr1 3150000 2216 chr1 3050000 chr1 3250000 1595 chr1 3050000 chr1 3350000 1310 chr1 3050000 chr1 3450000 1277 chr1 3050000 chr1 3550000 924 . . . . . . .

And a fragments.txt.gz file as chr1 0 50000 0 1 chr1 0 150000 0 1 chr1 0 250000 0 1 chr1 0 350000 0 1 chr1 0 450000 0 1

The fourth column of this file represents the total coverage. If these two files are generated properly then you should be able to run the fithic. You can run the fithic by using these two files using the following command

<path/python> <path/fithic.py> -i interactions.txt.gz -f fragments.txt.gz -l -o

Let me know how it goes.

cvahlensieck commented 2 years ago

Hi, Thanks! I can see the files. They look as described (except for using the "1" instead of "chr1" notation). attaching the first few rows:

fragments.txt.gz 1 0 50000 895 1 1 0 150000 484 1 1 0 250000 472 1

interactions.txt.gz 1 50000 1 50000 862 1 50000 1 150000 28 1 50000 1 250000 1 1 50000 1 450000 1 1 50000 1 1950000 1

Calling fithic gave me the following log: fithic -i interactions.txt.gz -f fragments.txt.gz -o test -r 100000

GIVEN FIT-HI-C ARGUMENTS
=========================
Reading fragments file from: fragments.txt.gz
Reading interactions file from: interactions.txt.gz
Output path created test
Fixed size option detected... Fast version of FitHiC will be used
Resolution is 100.0 kb
No bias file
The number of spline passes is 1
The number of bins is 100
The number of reads required to consider an interaction is 1
The name of the library for outputted files will be FitHiC
Upper Distance threshold is inf
Lower Distance threshold is 0
Only intra-chromosomal regions will be analyzed
Lower bound of bias values is 0.5
Upper bound of bias values is 2
All arguments processed. Running FitHiC now...
=========================

Reading the contact counts file to generate bins...
Interactions file read. Time took 10.549249649047852
Fragments file read. Time took 0.06877827644348145
Writing test/FitHiC.fithic_pass1.res100000.txt
Spline fit Pass 1 starting...
Outlier threshold is... 4.945155505607287e-08
Writing p-values and q-values to file test/FitHiC.spline_pass1.significances.txt
Number of outliers is... 58555
Spline fit Pass 1 completed. Time took 83.47774505615234
=========================
Fit-Hi-C completed successfully

However, FitHiC.spline_pass1.res100000.significances.txt still misses the ExpCC column. Attaching a few representative rows.

chr1 fragmentMid1 chr2 fragmentMid2 contactCount p-value q-value bias1 bias2 1 1050000 1 1150000 532 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1 1050000 1 1250000 347 7.632061e-02 2.726172e-01 1.000000e+00 1.000000e+00 1 1050000 1 1350000 210 2.669117e-01 7.619868e-01 1.000000e+00 1.000000e+00

When using this output for dcHiC fithic, it throws an out of index error since the last column misses. At least that's what I suppose.

ay-lab commented 2 years ago

Does your bed files also contain integer values as chromosome names instead of "chr1", "chr2" names? dcHiC assumes the chromosome names as character values by default and this may cause an issue in the dchic and/or also in fithic script. Can you please replace your chromosome names from integer to character values in the input files and run the fithic again?

cvahlensieck commented 2 years ago

I run fithic again, now with "chr1" notation in all input files, still the FitHiC.spline_pass1 does not contain the ExpCC column.

chr1 fragmentMid1 chr2 fragmentMid2 contactCount p-value q-value bias1 bias2 chr1 2250000 chr1 6450000 14 2.064885e-02 1.000000e+00 1.000000e+00 1.000000e+00 chr1 2250000 chr1 6550000 20 7.234523e-05 1.412170e-02 1.000000e+00 1.000000e+00 chr1 2250000 chr1 6650000 14 1.380949e-02 1.000000e+00 1.000000e+00 1.000000e+00

cvahlensieck commented 2 years ago

I could solve the problem: the pip version of FitHiC produces the behavior described above. Using the version from github works as expected and produces the ExpCC column.

This has also been reported here: https://groups.google.com/g/fithic/c/FzKEuPuB7AE

ay-lab commented 2 years ago

Ah! thanks for sharing the link. I will then close this issue as resolved. Please feel free to raise any further queries.

aryakaul commented 2 years ago

Thanks for bring it to our attention, the github version/bioconda version & pip version should now be in alignment.