deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
233 stars 70 forks source link

Missing options for hicBuildMatrix #351

Closed abhisheksinghnl closed 5 years ago

abhisheksinghnl commented 5 years ago

Hi,

I have installed hicexplorer using conda. The version that is installed is hicexplorer 2.2.1

However, when I look into the functionality of hicBuildmatrix using help I see following options

$ hicBuildMatrix --help
usage: hicBuildMatrix [-h] --samFiles two sam files two sam files --outBam bam
                      file (--binSize BINSIZE | --restrictionCutFile BED file)
                      [--fragmentLength FRAGMENTLENGTH]
                      [--minDistance MINDISTANCE] [--maxDistance MAXDISTANCE]
                      [--restrictionSequence RESTRICTIONSEQUENCE]
                      --outFileName FILENAME [--region CHR:START-END]
                      [--removeSelfCircles]
                      [--minMappingQuality MINMAPPINGQUALITY] [--doTestRun]
                      [--version]

It misses the options of QC folder and threads.

Could anyone please point as to what is going wrong in here.

thank you.

joachimwolff commented 5 years ago

Hi,

the help text should be:

hicBuildMatrix --help
usage: hicBuildMatrix --samFiles two sam files two sam files --outFileName
                      FILENAME --QCfolder FOLDER [--outBam bam file]
                      (--binSize BINSIZE [BINSIZE ...] | --restrictionCutFile BED file)
                      [--minDistance MINDISTANCE] [--maxDistance MAXDISTANCE]
                      [--maxLibraryInsertSize MAXLIBRARYINSERTSIZE]
                      [--restrictionSequence RESTRICTIONSEQUENCE]
                      [--danglingSequence DANGLINGSEQUENCE]
                      [--region CHR:START-END] [--keepSelfCircles]
                      [--minMappingQuality MINMAPPINGQUALITY]
                      [--threads THREADS] [--inputBufferSize INPUTBUFFERSIZE]
                      [--doTestRun] [--skipDuplicationCheck] [--help]
                      [--version]

Using an alignment from a program that supports local alignment (eg. Bowtie2)
where both PE reads are mapped using the --local option, this program reads
such file and creates a matrix of interactions.

Required arguments:
  --samFiles two sam files two sam files, -s two sam files two sam files
                        The two PE alignment sam files to process (default:
                        None)
  --outFileName FILENAME, -o FILENAME
                        Output file name for the Hi-C matrix. (default: None)
  --QCfolder FOLDER     Path of folder to save the quality control data for
                        the matrix. The log files produced this way can be
                        loaded into `hicQC` in order to compare the quality of
                        multiple Hi-C libraries. (default: None)

Optional arguments:
  --outBam bam file, -b bam file
                        Output bam file to process. Optional parameter. A bam
                        file containing all valid Hi-C reads can be created
                        using this option. This bam file could be useful to
                        inspect the distribution of valid Hi-C reads pairs or
                        for other downstream analyses, but is not used by any
                        HiCExplorer tool. Computation will be significantly
                        longer if this option is set. (default: None)
  --binSize BINSIZE [BINSIZE ...], -bs BINSIZE [BINSIZE ...]
                        Size in bp for the bins. The bin size depends on the
                        depth of sequencing. Use a larger bin size for
                        libraries sequenced with lower depth. Alternatively,
                        the location of the restriction sites can be given
                        (see --restrictionCutFile). Optional for mcool file
                        format: Define multiple resolutions which are all a
                        multiple of the first value. Example: --binSize 10000
                        20000 50000 will create a mcool file formate
                        containing the three defined resolutions. (default:
                        10000)
  --restrictionCutFile BED file, -rs BED file
                        BED file with all restriction cut places (output of
                        "findRestSite" command). Should contain only mappable
                        restriction sites. If given, the bins are set to match
                        the restriction fragments (i.e. the region between one
                        restriction site and the next). (default: None)
  --minDistance MINDISTANCE
                        Minimum distance between restriction sites.
                        Restriction sites that are closer than this distance
                        are merged into one. This option only applies if
                        --restrictionCutFile is given. (default: 300)
  --maxDistance MAXDISTANCE
                        This parameter is now obsolete. Use
                        --maxLibraryInsertSize instead (default: None)
  --maxLibraryInsertSize MAXLIBRARYINSERTSIZE
                        The maximum library insert size defines different cut
                        offs based on the maximum expected library size. *This
                        is not the average fragment size* but the higher end
                        of the the fragment size distribution (obtained using
                        for example a Fragment Analyzer or a Bioanalyzer)
                        which usually is between 800 to 1500 bp. If this value
                        if not known use the default of 1000. The insert value
                        is used to decide if two mates belong to the same
                        fragment (by checking if they are within this max
                        insert size) and to decide if a mate is too far away
                        from the nearest restriction site. (default: 1000)
  --restrictionSequence RESTRICTIONSEQUENCE, -seq RESTRICTIONSEQUENCE
                        Sequence of the restriction site. (default: None)
  --danglingSequence DANGLINGSEQUENCE
                        Sequence left by the restriction enzyme after cutting.
                        Each restriction enzyme recognizes a different DNA
                        sequence and, after cutting, they leave behind a
                        specific "sticky" end or dangling end sequence. For
                        example, for HindIII the restriction site is AAGCTT
                        and the dangling end is AGCT. For DpnII, the
                        restriction site and dangling end sequence are the
                        same: GATC. This information is easily found on the
                        description of the restriction enzyme. The dangling
                        sequence is used to classify and report reads whose 5'
                        end starts with such sequence as dangling-end reads. A
                        significant portion of dangling-end reads in a sample
                        are indicative of a problem with the re-ligation step
                        of the protocol. (default: None)
  --region CHR:START-END, -r CHR:START-END
                        Region of the genome to limit the operation to. The
                        format is chr:start-end. It is also possible to just
                        specify a chromosome, for example --region chr10
                        (default: None)
  --keepSelfCircles     If set, outward facing reads without any restriction
                        fragment (self circles) are kept. They will be counted
                        and shown in the QC plots. (default: False)
  --minMappingQuality MINMAPPINGQUALITY
                        minimum mapping quality for reads to be accepted.
                        Because the restriction enzyme site could be located
                        on top of the read, this may reduce the reported
                        quality of the read. Thus, this parameter may be
                        adusted if too many low quality (but otherwise
                        perfectly valid Hi-C reads) are found. A good strategy
                        is to make a test run (using the --doTestRun), then
                        checking the results to see if too many low quality
                        reads are present and then using the bam file
                        generated to check if those low quality reads are
                        caused by the read not being mapped entirely.
                        (default: 15)
  --threads THREADS     Number of threads. Using the python multiprocessing
                        module. One master process which is used to read the
                        input file into the buffer and one process which is
                        merging the output bam files of the processes into one
                        output bam file. All other threads do the actual
                        computation. Minimum value for the '--thread'
                        parameter is 2. The usage of 8 threads is optimal if
                        you have an HDD. A higher number of threads is only
                        useful if you have a fast SSD. Have in mind that the
                        performance of hicBuildMatrix is influenced by the
                        number of threads, the speed of your hard drive and
                        the inputBufferSize. To clearify: the peformance with
                        a higher thread number is not negative influenced but
                        not positiv too. With a slow HDD and a high number of
                        threads many threads will do nothing most of the time.
                        (default: 4)
  --inputBufferSize INPUTBUFFERSIZE
                        Size of the input buffer of each thread. 400,000 read
                        pairs per input file per thread is the default value.
                        Reduce this value to decrease memory usage. (default:
                        400000)
  --doTestRun           A test run is useful to test the quality of a Hi-C
                        experiment quickly. It works by testing only 1,000,000
                        reads. This option is useful to get an idea of quality
                        control values like inter-chromosomal interactions,
                        duplication rates etc. (default: False)
  --skipDuplicationCheck
                        Identification of duplicated read pairs is memory
                        consuming. Thus, in case of memory errors this check
                        can be skipped. However, consider running a
                        `--doTestRun` first to get an estimation of the
                        duplicated reads. (default: False)
  --help, -h            show this help message and exit
  --version             show program's version number and exit

I tested this with HiCExplorer version 2.2.1 and python 3.6.

Have you installed HiCExplorer in its own environment? Is it maybe possible that you have multiple HiCExplorer versions? What is the output of which hicBuildMatrix and whereis hicBuildMatrix?

Best,

Joachim

abhisheksinghnl commented 5 years ago

Hi,

Thank you for your reply. Here are the outputs.

which hicBuildMatrix /tools/eb/software/Miniconda3/4.4.10/envs/hicexplorer/bin/hicBuildMatrix

whereis hicBuildMatrix hicBuildMatrix: /gpfs/gssgpfs1/biogrid/tools/eb/software/Miniconda3/4.4.10/envs/hicexplorer/bin/hicBuildMatrix /gpfs/gssgpfs1/biogrid/tools/eb/software/Miniconda3/4.4.10/bin/hicBuildMatrix

I see the problem, but how should I fix it?

joachimwolff commented 5 years ago

Remove all HiCExplorer versions and install it again: conda remove hicexplorer to make conda happy, and run multiple times as long as no version is installed anymore: pip uninstall hicexplorer. Make sure no HiCExplorer is installed and then install HiCExplorer again with conda.

abhisheksinghnl commented 5 years ago

Hi,

I uninstalled hicexplorer from all the places.

reinstalled it conda create -c bioconda --name hicexplorer hicexplorer

checked it which hicBuildMatrix /tools/eb/software/Miniconda3/4.4.10/envs/hicexplorer/bin/hicBuildMatrix

whereis hicBuildMatrix hicBuildMatrix: /gpfs/gssgpfs1/biogrid/tools/eb/software/Miniconda3/4.4.10/envs/hicexplorer/bin/hicBuildMatrix

However, the version that is getting installed is 1.3.

$ hicBuildMatrix --version hicBuildMatrix 1.3

:(

An older version is being installed. How can I bypass this?

joachimwolff commented 5 years ago

conda create -c bioconda --name hicexplorer_new hicexplorer=2.2.1 python=3.6

gtrichard commented 5 years ago
$ hicBuildMatrix --version
hicBuildMatrix 2.2.1

Should be the expected outcome. I don't get how 1.3 can be installed from:

conda create -c bioconda --name hicexplorer hicexplorer
bgruening commented 5 years ago

@abhisheksinghnl can you please paste the conda create output here? So the versions that are fetched and installed? Thanks!

abhisheksinghnl commented 5 years ago

Hi,

I have used this command and it seems that all is fine now.

conda create -c bioconda -c conda-forge --name hicexplorer_new hicexplorer=2.2.1 python=3.6

thank you for your help.

bgruening commented 5 years ago

Cool, if you have time please post the output of our previous command, I'm curious.