Open maehler opened 3 years ago
I see now that there already is an issue related to this (#129). Apologies for not taking a proper look before posting. Feel free to close it, but I will leave the issue open since it’s a bit more detailed than the existing one.
This is handled for large genomes using the scale factor. @dudcha has a lot of details for handling this: https://www.dnazoo.org/methods But I will leave this issue as an enhancement because the file format should support this use case in the future.
I'm working on a relatively big genome (~20Gbp), and I want to create an assembly hic-file that can be used for editing in Juicebox. Previously I have created hic-files for individual chromosomes using Juicer Tools without any issues, but since there seems to be clear contacts between chromosomes, I also wanted to create a file for the whole genome, including unplaced scaffolds/contigs. In order to accomplish this, I have stripped the chromosome information from the pairs file and modified the positions so that they represent the cumulative positions based on the original sequence lengths. This then totals almost 19 Gbp of sequence, which would cause an overflow for a 32-bit integer.
As a side note, this approach was successfully used to generate a hic-file for all the unplaced scaffolds/contigs. This totalled about 1.8 Gbp of sequence data, i.e. something that fits nicely into a 32-bit integer.
Reproducible example
Input files
Pairs in short format (
test.pairs
)Chromosome sizes (
test.chromsizes
)Running
I try to generate the hic file with the command
The output is
Some thoughts
I don't know if this genome would behave well in Juicebox as it is. It's mostly just me trying different things in order to see what works the best for our particular analysis. Looking at supplementary table S2 from https://www.cell.com/cell-systems/fulltext/S2405-4712(16)30219-8 it looks like chromosome lengths are limited to 32-bit integers, so this will likely be an issue that translated to the hic format itself, as well as Juicebox. I appreciate that it might be a lot of work to add support for larger genomes, but I still wanted to put it out there to show that this is a problem, even if it's only for a small subset of your users.
System details
Java
OS
Juicer Tools