aidenlab / juicer

A One-Click System for Analyzing Loop-Resolution Hi-C Experiments
http://aidenlab.org
MIT License
410 stars 181 forks source link

Integer overflow when creating assembly hic file #222

Open maehler opened 3 years ago

maehler commented 3 years ago

I'm working on a relatively big genome (~20Gbp), and I want to create an assembly hic-file that can be used for editing in Juicebox. Previously I have created hic-files for individual chromosomes using Juicer Tools without any issues, but since there seems to be clear contacts between chromosomes, I also wanted to create a file for the whole genome, including unplaced scaffolds/contigs. In order to accomplish this, I have stripped the chromosome information from the pairs file and modified the positions so that they represent the cumulative positions based on the original sequence lengths. This then totals almost 19 Gbp of sequence, which would cause an overflow for a 32-bit integer.

As a side note, this approach was successfully used to generate a hic-file for all the unplaced scaffolds/contigs. This totalled about 1.8 Gbp of sequence data, i.e. something that fits nicely into a 32-bit integer.

Reproducible example

Input files

Pairs in short format (test.pairs)

0       assembly        18566221006     0       1       assembly        18566221208     1
0       assembly        18566221034     0       1       assembly        18566221181     1
0       assembly        18566221034     0       1       assembly        18566221181     1
0       assembly        18566221169     0       1       assembly        18566221613     1
0       assembly        18566221286     0       1       assembly        18566221401     1
0       assembly        18566221286     0       1       assembly        18566221401     1
0       assembly        18566221520     0       1       assembly        18566221627     1
0       assembly        18566221601     0       1       assembly        18566221767     1
0       assembly        18566221601     0       1       assembly        18566221767     1
0       assembly        18566222147     0       1       assembly        18566222275     1

Chromosome sizes (test.chromsizes)

assembly 18592910868

Running

I try to generate the hic file with the command

java -jar juicer_tools_1.22.01.jar pre test.pairs test.hic test.chromsizes

The output is

WARN [2021-04-07T09:15:22,677]  [Globals.java:138] [main]  Development mode is enabled                                                       
java.lang.NumberFormatException: For input string: "18592910868"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)                                                     
        at java.lang.Integer.parseInt(Integer.java:583)
        at java.lang.Integer.parseInt(Integer.java:615)
        at juicebox.data.HiCFileTools.loadChromosomes(HiCFileTools.java:157)
        at juicebox.tools.clt.old.PreProcessing.readArguments(PreProcessing.java:93)                                                         
        at juicebox.tools.HiCTools.main(HiCTools.java:93)
Using 1 CPU thread(s)
Not including fragment map
Start preprocess
Writing header
Writing body
java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment.
        at juicebox.tools.utils.original.MatrixZoomDataPP.mergeAndWriteBlocks(MatrixZoomDataPP.java:276)                                     
        at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:970)                                                     
        at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:659)                                                       
        at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:425)                                                      
        at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:139)                                                                  
        at juicebox.tools.HiCTools.main(HiCTools.java:94)

Some thoughts

I don't know if this genome would behave well in Juicebox as it is. It's mostly just me trying different things in order to see what works the best for our particular analysis. Looking at supplementary table S2 from https://www.cell.com/cell-systems/fulltext/S2405-4712(16)30219-8 it looks like chromosome lengths are limited to 32-bit integers, so this will likely be an issue that translated to the hic format itself, as well as Juicebox. I appreciate that it might be a lot of work to add support for larger genomes, but I still wanted to put it out there to show that this is a problem, even if it's only for a small subset of your users.

System details

Java

> java -version
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-b08)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)

OS

> lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.9.2009 (Core)
Release:        7.9.2009
Codename:       Core

Juicer Tools

> java -jar juicer_tools_1.22.01.jar
WARN [2021-04-07T10:08:38,488]  [Globals.java:138] [main]  Development mode is enabled
Juicer Tools Version 1.22.01
Usage:
        dump <observed/oe> <NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr1>[:x1:x2] <chr2>[:y1:y2] <BP/FRAG> <binsize> [outfile]
        dump <norm/expected> <NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr> <BP/FRAG> <binsize> [outfile]
        dump <loops/domains> <hicFile URL> [outfile]
        pre [options] <infile> <outfile> <genomeID>
        addNorm <input_HiC_file> [input_vector_file]
        pearsons [-p] <NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr> <BP/FRAG> <binsize> [outfile]
        eigenvector -p <NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr> <BP/FRAG> <binsize> [outfile]
        apa <hicFile(s)> <PeaksFile> <SaveFolder>
        arrowhead <hicFile(s)> <output_file>
        hiccups <hicFile> <outputDirectory>
        hiccupsdiff <firstHicFile> <secondHicFile> <firstLoopList> <secondLoopList> <outputDirectory>
        validate <hicFile>
        -h, --help print help
        -v, --verbose verbose mode
        -V, --version print version
Type juicer_tools <commandName> for more detailed usage instructions
maehler commented 3 years ago

I see now that there already is an issue related to this (#129). Apologies for not taking a proper look before posting. Feel free to close it, but I will leave the issue open since it’s a bit more detailed than the existing one.

sa501428 commented 2 years ago

This is handled for large genomes using the scale factor. @dudcha has a lot of details for handling this: https://www.dnazoo.org/methods But I will leave this issue as an enhancement because the file format should support this use case in the future.