galaxyproject / tools-iuc

Tool Shed repositories maintained by the Intergalactic Utilities Commission
https://galaxyproject.org/iuc
MIT License
162 stars 436 forks source link

hicBuildMatrix cannot take space-separated list for Sequence of the restriction site or Dangling sequence #6505

Open bskubi opened 2 weeks ago

bskubi commented 2 weeks ago

Describe the bug I am using HicExplorer Galaxy on https://hicexplorer.usegalaxy.eu. Using the hicBuildMatrix tool on Hi-C data prepped with two restriction enzymes, I need to put both sequences in for Sequence of the restriction site and Dangling sequence. HiCExplorer's documentation says it can handle space-separated lists for these arguments. However, when I enter a space-separated list on the Galaxy form when designing my workflow (prior to the run), the box turns red and says that non-numeric characters are not allowed, preventing me from inputting a space-separated list.

Galaxy Version and/or server at which you observed the bug {"version_major":"24.1","version_minor":"2.dev0"}

Browser and Operating System Operating System: Linux Jammy Jellyfish Browser: Chrome

To Reproduce Steps to reproduce the behavior:

  1. Go to hicBuildMatrix
  2. Enter "AAGCTT GATC" in the Sequence of the restriction site or "AGCT GATC" Dangling sequence text input boxes

Expected behavior It should permit the above as a valid input to either of these boxes (the exact string content doesn't matter, as long as there's a space separating two strings).

Additional context HicExplorer's documentation states that space-separated lists are permitted, and I'm not able to run the workflow due to this bug, so I'm assuming the issue lies with Galaxy's input validation rather than with HicExplorer.

bernt-matthias commented 2 weeks ago

Thanks for the report. Do you have a link to the docs at hand?

I guess space separated list of strings consisting of [atcgATCG] (and maybe N) should be fine, or?

bskubi commented 1 week ago

Here's the relevant page from the docs.

See the first sentence of the --restrictionSequence and --danglingSequence arguments.

I didn't create the HiCExplorer tool, I'm just trying to benchmark against it, so unfortunately that is all the info I have!

bskubi commented 1 week ago

@bernt-matthias I would also note that the multi-bin feature on hicBuildMatrix appears to be broken, and possibly hicNormalize as well. Here's what I tried:

I specified multiple bin resolutions using a single hicBuildMatrix step (10kb, 20kb, 50kb, 100kb), expecting it to produce a single multi-res cooler file (.mcool) containing all 4 resolutions.

I then fed the output from this single hicBuildMatrix step into a hicNormalize step, expecting it to add normalizations to the 0-1 range to all four resolutions.

Instead, the result was that hicBuildMatrix produced a single-resolution cooler file (I believe at 10kb resolution), and the hicNormalize function produced a 0-byte empty output.

I'm not sure if hicNormalize is broken or if it only failed because of the issue with hicBuildMatrix. I'm trying it another way calling hicNormalize separately on each individual resolution. However, I currently can't figure out a way to produce a multi-res .mcool matrix using Galaxy HiCExplorer (I know how to make them using other tools, I'm just trying to figure out if it's currently possible on Galaxy HiCExplorer specifically).

bernt-matthias commented 1 week ago

The original problem should be fixed in https://github.com/galaxyproject/tools-iuc/pull/6519

For the other problem: Can you check the produced command line(s)? Maybe it's an upstream problem?

bernt-matthias commented 1 week ago

@bskubi please feel free to reopen if needed or open a new issue. Thanks again for the report.

bskubi commented 2 days ago

@bernt-matthias

The command line generated by hicBuildMatrix when trying to build multiple bin sizes is:

mkdir ./QCfolder && mkdir '/data/jwd02f/main/075/495/75495779/outputs/dataset_1765acd4-17e5-4a32-92eb-a8c566c6b93e_files' && hicBuildMatrix --samFiles '/data/dnb10/galaxy_db/files/a/2/4/dataset_a24ca246-f55d-40ed-8d2b-b3dabe741ff4.dat' '/data/dnb10/galaxy_db/files/c/3/c/dataset_c3c4f136-6c80-4e5e-b775-fd62f08e84ef.dat'  --restrictionCutFile '/data/dnb10/galaxy_db/files/d/a/1/dataset_da1a0a55-ebfa-4e1d-ad03-7828b4bec739.dat'  --restrictionSequence 'AAGCTT' --danglingSequence 'AGCT'  --binSize '10000' '20000' '50000' '100000'  --chromosomeSizes '/data/dnb10/galaxy_db/files/7/4/9/dataset_749e98a5-04c7-4ba4-a385-8309db2d3053.dat' --genomeAssembly 'hg38'   --outFileName 'matrix.cool'       --minMappingQuality 30  --threads ${GALAXY_SLOTS:-4}  --QCfolder ./QCfolder && mv ./QCfolder/* /data/jwd02f/main/075/495/75495779/outputs/dataset_1765acd4-17e5-4a32-92eb-a8c566c6b93e_files/ && mv '/data/jwd02f/main/075/495/75495779/outputs/dataset_1765acd4-17e5-4a32-92eb-a8c566c6b93e_files/hicQC.html' '/data/jwd02f/main/075/495/75495779/outputs/dataset_1765acd4-17e5-4a32-92eb-a8c566c6b93e.dat' && mv "/data/jwd02f/main/075/495/75495779/outputs/dataset_1765acd4-17e5-4a32-92eb-a8c566c6b93e_files"/*.log raw_qc && mv matrix.cool matrix

The relevant bit is --binSize '10000' '20000' '50000' '100000'

According to the documentation for hicBuildMatrix:

--binSize, -bs
Size in bp for the bins. The bin size depends on the depth of sequencing. Use a larger bin size for libraries sequenced with lower depth. If not given, matrices of restriction site resolution will be built. Optionally for mcool file format: Define multiple resolutions which are all a multiple of the first value. Example: –binSize 10000 20000 50000 will create a mcool file formate containing the three defined resolutions.

So it seems like the --binSize argument is correctly formatted. I'm not sure if this is an issue with hicBuildMatrix or what. It does produce a usable single-resolution matrix at the smallest binSize, and I'm also unsure of why it is not being successfully normalized by the subsequent hicNormalizeMatrix step.

bskubi commented 1 day ago

@bernt-matthias

It appears that the reason that the attempt to build a multi-resolution .mcool file is not working is that Galaxy hardcodes the output file name as having a .cool extension rather than a .mcool extension. However, the hicBuildMatrix command-line utility infers whether a .cool or .mcool file ought to be built based on this extension. So even if multiple --binSize parameters are passed, only a .cool file will be built.

For this understanding, I am referring to the documentation here.

hicBuildMatrix supports building multicooler matrices which are for example needed for visualization with HiGlass. To do so, use as out file format either .cool or .mcool and define the desired resolutions as –binSize.

A potential solution would be to add an additional output file type '.mcool' in addition to '.cool' and '.h5' and select the filename extension accordingly.

bernt-matthias commented 1 day ago

Seems to be the easiest solution (maybe plus some docs).