bxlab / hifive

Tools for handling HiC and 5C data
MIT License
22 stars 8 forks source link

Unable to load txt matrices (interaction matrices) obtained from Homer #18

Closed pubudumanoj closed 4 years ago

pubudumanoj commented 4 years ago

I tried to create a hic-data object using following commands

hifive fends -B fend.bed --binned=50000 out_filename.txt

When I run hifive hic-data -X "test_*.MatA" out_filename.txt output.hic.data I get "Done 0 cis reads, 0 trans reads" in the command line output and I cannot use this file for next steps (which is 6.4kb in size).

I obtained interaction matrices from homer and modified it according to the explanation in the documentation. My structure for one test_*.MatA file is chr1:0-50000 chr1:50000-100000 chr1:100000-150000 chr1:150000-200000 chr1:0-50000 7 0 0 0 chr1:50000-100000 0 11 0 0 chr1:100000-150000 0 0 28 0 chr1:150000-200000 0 0 0 0

The headers are column and row names and file is a TSV file

Can you please guide me how to resolve this issue or whether something wrong with my matrix structure

msauria commented 4 years ago

At first glance I'm not sure what the problem is. I've used your example matrix file and it works for me. The two things to pay attention to are 1) Do you get a message saying that is loading your matrix file(s)? This is unfortunately a little tricky to assess as the messages over-write each other as the program steps through each phase of the dataset creation. However, if you pipe the stderr into a file with hifive hic-data -X "test_*.MatA" out_filename.txt output.hic.data 2> hifive.log then you can look at the log file and see if each file name appears. If not, then there is an issue with the automatic recognition of the file names, which brings me to the second thing to check. 2) Do your chromosome names in the matrix file names match those in your fend.bed file? If your fend.bed chromosome names are 'chr1', 'chr2', etc, then your matrix file names need to be test_chr1.mat, test_chr2.mat, test_chr1_bychr2.mat and the argument you pass should be 'test*.mat'. If this isn't the issue, let me know and we can keep digging.

pubudumanoj commented 4 years ago

I renamed the file names as you mentioned and re-ran the code. I removed all the column names and row names from the matrix file (similar to test data set). So the matrix is a square matrix ( number of rows = number of columns). Now I get this error

hifive hic-data -X "test_*.mat" binned.fends output.hic.data

Traceback (most recent call last):
File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/bin/hifive", line 849, in main() File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/bin/hifive", line 69, in main run(args) File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/lib/python2.7/site-packages/hifive/commands/create_hic_dataset.py", line 15, in run data.load_binned_data_from_matrices(args.fend, args.matrix, format=None) File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/lib/python2.7/site-packages/hifive/hic_data.py", line 819, in load_binned_data_from_matrices cis_counts, trans_counts = self._load_txt_matrices(data, filename, chroms, bins, bin_indices) File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/lib/python2.7/site-packages/hifive/hic_data.py", line 943, in _load_txt_matrices temp = row_labels[i].split('|')[-1].split(':')[1].split('-') IndexError: list index out of range

hifive hic-data -X "test_*.mat" binned.fends output.hic.data 2> log.txt

the produced log file is attached log.txt

When I used a matrix file with column names and row names (as explained in the question) I get this error

Traceback (most recent call last):
File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/bin/hifive", line 849, in main() File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/bin/hifive", line 69, in main run(args) File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/lib/python2.7/site-packages/hifive/commands/create_hic_dataset.py", line 15, in run data.load_binned_data_from_matrices(args.fend, args.matrix, format=None) File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/lib/python2.7/site-packages/hifive/hic_data.py", line 819, in load_binned_data_from_matrices cis_counts, trans_counts = self._load_txt_matrices(data, filename, chroms, bins, bin_indices) File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/lib/python2.7/site-packages/hifive/hic_data.py", line 944, in _load_txt_matrices row_labels[i] = (int(temp[0]) + int(temp[1])) / 2 ValueError: invalid literal for int() with base 10: '50000 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 '

msauria commented 4 years ago

Okay, this is good progress. The new issue is because the file is actually space-separated, not tab separated. If you replace the spaces with tabs, I think everything should work.

pubudumanoj commented 4 years ago

Now I get this error Traceback (most recent call last):
File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/bin/hifive", line 849, in main() File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/bin/hifive", line 69, in main run(args) File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/lib/python2.7/site-packages/hifive/commands/create_hic_dataset.py", line 15, in run data.load_binned_data_from_matrices(args.fend, args.matrix, format=None) File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/lib/python2.7/site-packages/hifive/hic_data.py", line 819, in load_binned_data_from_matrices cis_counts, trans_counts = self._load_txt_matrices(data, filename, chroms, bins, bin_indices) File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/lib/python2.7/site-packages/hifive/hic_data.py", line 921, in _load_txt_matrices tempdata = numpy.array(tempdata, dtype=numpy.int32) ValueError: setting an array element with a sequence.

log file is attached log.txt

I think it will be easy if I attach my matrix file. Because the issue should be in that file. Please check the attached matrix file https://drive.google.com/file/d/1nv_4yqpF-sLrWqXGEXJNcas0b3_bSKJv/view?usp=sharing

Thank you for helping

msauria commented 4 years ago

This was really only intended for loading raw data so it is expecting integer values. The decimals are causing issues in loading the data. One option, since it looks like there are only integers and X.5 values, would be to simply double all of the counts. Also, I would suggest keeping the column and row labels or double checking that the number of rows is equal to the number of bins produced with your chromosome length/bed file and bin size.

pubudumanoj commented 4 years ago

Yes it worked when I multiply all values by 2. However if there are values other than 0.5, is it okay to multiply by 10? Does it effect the quality score?

msauria commented 4 years ago

It will have a minor effect on the quality scores. The scale of that impact is going to depend on how sparse your data are. The number of empty bins (which will be different for each resolutIon) will be roughly proportional to the error introduced into the quality score.

pubudumanoj commented 4 years ago

If we use the same resolution (bin size) for all the matrices, then the error introduced by the multiplication would be same right?

msauria commented 4 years ago

If the samples are of similar sequencing depth, then the error should be very similar. There is a step that involves a pseudo count addition so the magnitude of the error is going to be influenced by the number of zero bins. However, now that I'm thinking about it carefully, the larger the scaling factor, the smaller the influence of the pseudo count, so I think multiplying by 10 should be fine.

pubudumanoj commented 4 years ago

Okay thank you. I will get back to you if I gotten in to any other issue.

pubudumanoj commented 4 years ago

Sorry to bother you again. In the next step I got another error. When I run mpirun -np 4 hifive hic-project output.hic.data output.hic.project I get Traceback (most recent call last): File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/bin/hifive", line 849, in main() File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/bin/hifive", line 72, in main run(args) File "/cvmfs/soft.mugqic/CentOS6/software/python/Python-2.7.14/lib/python2.7/site-packages/hifive/commands/create_hic_project.py", line 15, in run comm = MPI.COMM_WORLD NameError: global name 'MPI' is not defined

Do you have any idea what would be the reason for this?

Thank you

msauria commented 4 years ago

This suggests that mpi4py is not installed.

pubudumanoj commented 4 years ago

mpi4y was already installed. I checked with the test code specified in the documentation mpirun -np 5 python helloworld.py and it works fine I tried several things but still gets the same error

pubudumanoj commented 4 years ago

I think I got some clues. There is an issue with from mpi4py import MPI then it gets Traceback (most recent call last): File "", line 1, in ImportError: libmpi.so.20: cannot open shared object file: No such file or directory

I will try to resolve this and it will probably resolve the issue. Thank you

pubudumanoj commented 4 years ago

I uninstalled mpi4y and its now its working fine with sequential processing