hicBuildMatrix sensitive to input fasta sequence when used with fragment-sized bins

dmalzl commented 5 years ago

I recently processed Hi-C data provided by a colleague of mine, who performed the alignment steps. I then used hicBuildMatrix in conjunction with findRestSites to produce a contact matrix with fragment-sized bins. However, while counting reads in different bins went well the enlargment steps of the bins failed to complete saying:

Traceback (most recent call last):
  File "/users/daniel.malzl/.conda/envs/hicexplorer/bin/hicBuildMatrix", line 7, in <module>
    main()
  File "/users/daniel.malzl/.conda/envs/hicexplorer/lib/python3.6/site-packages/hicexplorer/hicBuildMatrix.py", line 1371, in main
    bin_intervals = enlarge_bins(bin_intervals[:], chrom_sizes)
  File "/users/daniel.malzl/.conda/envs/hicexplorer/lib/python3.6/site-packages/hicexplorer/hicBuildMatrix.py", line 645, in enlarge_bins
    bin_intervals[idx] = (chrom, start, chrom_sizes_dict[chrom])
KeyError: 'chr11_gl000202_random'

Tracing the error in code revealed as discrepancy between the way the script retrieves the chromosome sizes and the way it tries to expand bins to locate them next to each other. In particular, hicBuildMatrix first retrieves the sizes of the chromosomes to which reads are mapped from the BAM files provided. Subsequent reading of the restrictionCutFile starts the counting of reads in the defined bins. However, the enlarge_bins function expects both, the bin intervals (which are the cutSiteIntervals in our case) and the chromosome sizes without making sure that these two match (i.e. contain the same chromosomes).

Referring to the initial problem I used the whole hg19 genome as fasta downloaded from UCSC to construct the restrictionCutFile, which also contains random and non-ordered contigs, that are not present in the index used to align the reads and thus are not in the BAM header, which in turn produces the above error, because the chrom_size_dict only contains chromosomes present in the BAM.

While this is only a subtle bug and nothing really frustrating, I suggest mentioning this in the documentation or just write a short check before read counting making sure only those chromosomes get incorporated for which we also have an entry in the BAM header.

joachimwolff commented 5 years ago

Hi Daniel,

thanks for the recommendation, we will consider it in a future release.

Best,

Joachim

14stutzmanav commented 4 years ago

Hi Daniel,

I am having the same problem in the dm6 genome. To clarify, this issue can be gotten around by aligning to the same fasta as the one used to generate the restrictionCutFile?

Thanks so much! Alexis

dmalzl commented 4 years ago

Yes, the chromosomes you align to are later used to find the restriction fragments in the restriction cut site file. If you use a different fasta to generate this file than you have to generate your aligner index then it produces an error. Another way of getting around this would be to remove all non-canonical chromosomes (i.e. all _random stuff and just keeping chr*)

Best, Daniel

joachimwolff commented 4 years ago

With release 3.5 we added the option to define a file with the sizes of the chromosomes. If this file is given, the chromosomes and their sizes as given by the BAM input is ignored. The parameter is --chromosomeSizes.

The file needs the structure:

chrName\tlength
chrName2\tlength

For example: http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/dm3.chrom.sizes

I think this should fix your issue, if not, pleas reopen the issue.

Best,

Joachim

distilledchild commented 2 years ago

@dmalzl Hi Daniel, I read this threads because I got this error when making matrix .h5 file which will be used for an assembly tool, HiCAssembly. That means, my Hi-C fastq files were aligned to draft genome assembly composed of lots of contigs and scaffolds. So do you suggest that I should make and use a size file of contigs and scaffolds for the draft genome that I want to rescaffold?

dmalzl commented 2 years ago

Hi @theshowmustgolangon,

Since this was already a while ago. I had to reread what the exact problem was here. The whole trick here is to make sure that the genome assembly you align to matches the chromsizes file. So if you compute a size file from the fasta you built the aligner index from it should be fine.

deeptools / HiCExplorer

hicBuildMatrix sensitive to input fasta sequence when used with fragment-sized bins #452