deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
233 stars 70 forks source link

Input matrix in tsv format #749

Closed nyuhic closed 3 years ago

nyuhic commented 3 years ago

Hello,

This is not really an issue but a question specific to my data so I would appreciate any help.

I have some 50kb matrices in tsv format where the first column is chromosome, second column is start of first interacting bin, third column is start of second interacting bin and fourth column is no of interactions as shown below:

chr1 3000000 3000000 1810 chr1 3000000 3050000 434 chr1 3000000 3100000 367 chr1 3000000 3150000 179

I dont think any of the provided conversion scripts can convert this into a format that works with hicexplorer. I was wondering if there is any way I can manipulate this matrix into a format that can eventually be used with hicexplorer? I want to do PCA analysis with hicPCA.

Would appreciate any tips.

Thanks

joachimwolff commented 3 years ago

Hi,

Something like Excel could help you (or any other way to manipulate this file, e.g., pandas). The supported input format 2D-text needs:

chr1 start1 end1 chr2 start1 end2 value

Additonal you need to define the resolution and the chromosome sizes to make the import work. You can export to cool or h5 to have the support for HiCExplorer; or homer, ginteractions, hicpro if you need it in these formats for a third party tool. You know the resolution; therefore, you could add columns via, e.g., Excel to the following:

chr1 3000000 3050000 chr1 3000000 3050000 1810
chr1 3000000 3050000  chr1 3050000 3100000 434
chr1 3000000 3050000  chr1 3100000 3150000 367
chr1 3000000 3050000  chr1 3150000 3200000 179

Best,

Joachim

lldelisle commented 3 years ago

You can use awk:

awk -v OFS="\t" -v res=50000 '{print $1,$2,$2 + res, $3, $3 + res, $4}' input_file.txt > output_2d.txt
nyuhic commented 3 years ago

Thanks a lot!

Didn't realize support for 2d-text files had been added in the recent release.

nyuhic commented 3 years ago

I used the awk command above to convert my matrices to the required 2D-text format. The matrix looks fine to me:

chr1    3000000 3050000 3000000 3050000 1810
chr1    3000000 3050000 3050000 3100000 434
chr1    3000000 3050000 3100000 3150000 367
chr1    3000000 3050000 3150000 3200000 179

I then run the conversion command as follows (the chromosome sizes file is from the recommended ucsc website):

hicConvertFormat -m matrices/rep1_raw_50kb.chr --inputFormat 2D-text --outputFormat h5 --resolutions 50000 --chromosomeSizes mm10.chrom.sizes -o rep1.h5

It gives me the following error:

Traceback (most recent call last): File "/scratch/ucsc_conda/hicexplorer/bin/hicConvertFormat", line 7, in main() File "/scratch/ucsc_conda/hicexplorer/lib/python3.8/site-packages/hicexplorer/hicConvertFormat.py", line 209, in main value = float(line_split[6]) IndexError: list index out of range

Any ideas what might be going on?

Thanks

lldelisle commented 3 years ago

Sorry wrong format, you need to specify twice the chromosome because trans are supported:

awk -v OFS="\t" -v res=50000 '{print $1,$2,$2 + res, $1, $3, $3 + res, $4}' input_file.txt > output_2d.txt