deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
233 stars 70 forks source link

hicBuildMatrix complains about read order #673

Closed cgirardot closed 3 years ago

cgirardot commented 3 years ago

Dear, I am using galaxy and therefore the version 3.4.3 of hicexplorer. My problem would be solved with the hicexplorer 3.6 but all my workflows are in Galaxy so I am stuck. Here is the problem, maybe you can help.

I am mainly using binned matrices but I always build the restriction enzyme matrix as well. In particular the QC is more accurate as hicBuildMatrix running in this mode detects and filters more reads ie

With rest site file:

Screenshot 2021-02-10 at 18 53 17

Bining (no rest site):

Screenshot 2021-02-10 at 18 52 05

I am therefore interested to build a bin matrix with hicBuildMatrix but only using the reads that are deemed "good" from "hicBuildMatrix with restr. site file". For this, I thought I could use the "valid read" BAM files created by hicBuildMatrix.

The WF would look like this:

mapping -> F&R qname sorted BAM -> hicBuildMatrix (with restr sites) -> "valid read" BAM file -> split by F | R into 2 bam files -> name sort -> hicBuildMatrix into bin matrix

Unfortunately, the last step keep failing with

INFO:hicexplorer.hicBuildMatrix:reading /g/funcgen/galaxy-production/database/files/000/411/dataset_411301.dat and /g/funcgen/galaxy-production/database/files/000/411/dataset_411302.dat to build hic_matrix

INFO:hicexplorer.hicBuildMatrix:dangling sequences to check are {'pat_forw': 'GATC', 'pat_rev': 'GATC'}

Traceback (most recent call last):
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/mulled-v1-34b5a02e84ed6e4ea683a4aaa554b2adbec36373e59d7ab5485292263fde6672/bin/hicBuildMatrix", line 7, in <module>
    main()
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/mulled-v1-34b5a02e84ed6e4ea683a4aaa554b2adbec36373e59d7ab5485292263fde6672/lib/python3.7/site-packages/hicexplorer/hicBuildMatrix.py", line 1229, in main
    pMinMappingQuality=args.minMappingQuality
  File "/g/funcgen/galaxy-production/database/dependencies/_conda/envs/mulled-v1-34b5a02e84ed6e4ea683a4aaa554b2adbec36373e59d7ab5485292263fde6672/lib/python3.7/site-packages/hicexplorer/hicBuildMatrix.py", line 692, in readBamFiles
    "the --reorder option".format(mate1.qname, mate2.qname)
AssertionError: FATAL ERROR NB501598:1014:HH5TLBGXF:1:11101:10002:14630 NB501598:1014:HH5TLBGXF:1:11101:10002:2140 Be sure that the sam files have the same read order If using Bowtie2 or Hisat2 add the --reorder option

I am very confused as the BAM files I recreate from the "valid read" BAM file look OK. I made sure to only keep reads that are

Do you have an idea what could happen here? Have you ever tried a WF like this? Thx

joachimwolff commented 3 years ago

This is actually not a problem with the version of HiCExplorer, but as it seems from the Galaxy wrapper. I noticed that since version 1.8 the wrapper mutually excludes the bin size option and the restriction cut site file, which should not be done. Anyway, the really old version 1.7 seems to have not the mutual exclusion, maybe it is worth giving it a try with this version? However, the QC report might be missing, I am not sure, this is a HiCExplorer and Galaxy wrapper version before I started to work on this project and I would need to check the source code first for any statement.

However, your WF: It could be that a) the second sorting is causing an issue and b) that the output bam file is not having the datatype 'qname_sorted', if this is the case Galaxy sorts by default the file and causes this crash.

But yeah, I need to update the wrappers to the latest HiCExplorer version, but don't have that much time for it at the moment. Maybe I will try to fix it in between other stuff here and then.

cgirardot commented 3 years ago

This is interesting, as a quick fix I can also amend the wrapper locally and pass the restriction cut site file (after a command line test to make sure this works as expected).

I really don't see why the second sorting would cause an issue but I'll check the files manually as the error suggests files are not correctly sorted. As for (b), changing the type does not solve it (manual launch of the job).

What is supposed to be in the valid reads BAM ? Only mapped read pairs I assume ?

joachimwolff commented 3 years ago

(b), changing the type does not solve it (manual launch of the job).

The problem is then that the output is defined as BAM and then Galaxy sorts it by default. To change the format after the output is placed, will not change this.

cgirardot commented 3 years ago

sorry I am being unclear. So far I am testing this and run the steps manually (no WF) so the result of the qname sorting using the Picard tool is qname sorted (it is actually SAM, pls see picture below). Then I manually execute hicBuildMatrix which fails complaining about unsorted reads; which I can only conceive if my BAMs (that I extract from the valid reads BAM) would contain different set of reads (which is why I set the extra filters above to make sure I only extract read pairs with both end mapped).

Screenshot 2021-02-11 at 10 27 34
cgirardot commented 3 years ago

regarding your comment

This is actually not a problem with the version of HiCExplorer, but as it seems from the Galaxy wrapper. I noticed that since version 1.8 the wrapper mutually excludes the bin size option and the restriction cut site file, which should not be done

I read from hicBuildMatrix 3.4.3 help that it is either binSize or restrictionCutFile:

--restrictionCutFile BED file, -rs BED file
                        BED file with all restriction cut places (output of
                        "findRestSite" command). Should contain only mappable
                        restriction sites. If given, the bins are set to match
                        the restriction fragments (i.e. the region between one
                        restriction site and the next). Alternatively, a fixed
                        binSize can be defined instead. However, either
                        binSize or restrictionCutFile must be defined.
                        (default: None)
cgirardot commented 3 years ago

and indeed the command line does not accept it :

hicBuildMatrix: error: argument --binSize/-bs: not allowed with argument --restrictionCutFile/-rs
joachimwolff commented 3 years ago

Alright, so you really need HiCExplorer version 3.6. I will try to find some time on the weekend to update the wrappers, but I don't know how long it might take until the Galaxy project accepts them + how long it will take it is installed on hicexplorer.usegalaxy.eu.

Best,

Joachim

cgirardot commented 3 years ago

that would be awesome, thank you. Note that we have our own local server so as soon as this is in the toolshed, we can deploy.

cgirardot commented 3 years ago

@joachimwolff I still wanted to understand why my approach is not working and there is something not adding up here.

I have checked the order of my 2 sam files with

paste <(awk '{print $1}' fwd_qnamesorted.sam) <(awk '{print $1}' rev_qnamesorted.sam) | awk '$1 != $2{print NR}'

and no difference is reported!

So I executed hicBuildMatrix command line, which errored with:

AssertionError: FATAL ERROR NB501598:1014:HH5TLBGXF:1:11101:10002:14630 NB501598:1014:HH5TLBGXF:1:11101:10002:2140 Be...```

and located the row number for NB501598:1014:HH5TLBGXF:1:11101:10002:14630, this is actually right at the beginning as shown below (fifth read after last header):

(skipping header)
@PG ID:bwa  PN:bwa
NB501598:1014:HH5TLBGXF:1:11101:10000:1294  81  chr3L
NB501598:1014:HH5TLBGXF:1:11101:10001:13024 65  chr2R
NB501598:1014:HH5TLBGXF:1:11101:10001:16048 65  chr2L
NB501598:1014:HH5TLBGXF:1:11101:10001:17608 65  chr2R
NB501598:1014:HH5TLBGXF:1:11101:10002:14630 81  chr4
NB501598:1014:HH5TLBGXF:1:11101:10002:2140  65  chr3L

and from mate

(skipping header)
@PG ID:bwa  PN:bwa
NB501598:1014:HH5TLBGXF:1:11101:10000:1294  145 chr2R
NB501598:1014:HH5TLBGXF:1:11101:10001:13024 145 chr2R
NB501598:1014:HH5TLBGXF:1:11101:10001:16048 145 chr2L
NB501598:1014:HH5TLBGXF:1:11101:10001:17608 145 chr2R
NB501598:1014:HH5TLBGXF:1:11101:10002:14630 129 chr2R
NB501598:1014:HH5TLBGXF:1:11101:10002:2140  129 chrX

the reads are perfectly aligned to me! And I don't get why hicBuildMatric complains :-/

joachimwolff commented 3 years ago

The updated wrappers as promised: https://github.com/galaxyproject/tools-iuc/pull/3481

Sorry, I did not have the time to go into detail about your workaround issue. I hope I can find some time next week for it.

Best,

Joachim

cgirardot commented 3 years ago

Thank you!

joachimwolff commented 3 years ago

I am really sorry the PR on tools-iuc is still not merged. I don't know why the people care less about merging passing PRs, while other PRs seem to be accepted quite fast. Maybe you know someone with merge rights?

cgirardot commented 3 years ago

I know Bjorn but you pung him twice already.

bgruening commented 3 years ago

I will have time over the weekend. But @cgirardot please feel free to have a look at the PR from @joachimwolff and give feedback.

cgirardot commented 3 years ago

I don't have much experience with writing wrappers for galaxy (or very simple ones). I unfortunately don't think I can really judge by reading the PR.