Matrix is too sparse after mapping

manuelfmerino commented 2 years ago

Hello,

I am trying to process the data from a capture Hi-C experiment of human chromosome 12. I am following the steps from the tutorial using data from a publication, trying to reach to similar results to those they have. I'm using fragment-based mapping. My problem is that after keeping the uniquely interacting read pairs, the number of reads decreases enormously, and my interaction matrices become somewhat sparse. Here are some examples:

Original number of reads in downloaded dataset (as provided by the publication): 84,386,237
My number of uniquely mapped interactions (in each end). End 1: 10,513,780. End 2: 10,429,881
My number of uniquely mapped read pairs (result of get_intersection): 6,331,686

As you can see, the number of uniquely mapped read pairs is less than 10% of the original number of reads. However, in the publication I'm taking as a reference, the number of uniquely mapped pairs is 52,386,237 at this stage of the pipeline, which I believe is a more reasonable number. Am I right and my numbers are just too low?

Some information on my procedure:

I believe that my original number of Hi-C reads is right, downloaded the files from the proper links and compared their length to that described in the publication, they perfectly match.
I downloaded only human Chr12 from the GRCh37 assembly, as this was the assembly employed in the publication (I know it's an outdated one).
I indexed the genome using GEM3 indexer (v3.6.0). I originally intended to use GEM version 2 as described in the TADbit installation guide (i3 version), but when I tried to run the indexing on my .fasta fiile of the chromosome 12, it threw the following error:

Fatal error (gem-indexer_fasta2meta+cont.c:368,main) Malformed FASTA/FASTQ file (sequence #1)

I tried to solve it but didn't find much on the Internet and gave up. Installed GEM3 instead using conda (I performed the whole TADbit installation on conda), and it worked without a problem. I assumed it was okay, but since now I'm facing weird results, I'm starting to wonder whether the Chr12 file I used as a reference genome is correct, or if a different version of GEM might be causing this.

I then ran a fragmented mapping using full_mapping, and subsequently ran parse_fasta, parse_map and get_intersection.

Just in case, I'm running TADbit 1.0.1 (the latest available version on conda). I also redownloaded the files and reference genome and reran everything. obtaining identical results.

Any help would be more than welcome, I've been struggling with this for a few days now.

Thanks a lot, Manuel F. Merino

david-castillo commented 2 years ago

Hi Manuel,

The main difference between gem2 and gem3 is that they allow different mismatches in the mapping of the reads and therefore we cannot expect the same number of mapped reads. But that does not explain big differences in the numbers.

In my view your main problem comes from step 1 to step 2, the mapping of the reads which is suspiciously low. Which publication are you using to test?

Regards

David

manuelfmerino commented 2 years ago

Hi David,

Thanks a lot for your answer. I agree that the problem likely comes from these steps. The publication I'm following is currently under review (by some collaborators of ours). I'm having trouble reaching who was in charge of the capture hi-c data processing, and figured it would be faster to try and reprocess the data myself. While the article is still not public, the dataset is, and can be found here: https://www.ebi.ac.uk/ena/browser/view/PRJEB42293

Cheers, Manuel

3DGenomes / TADbit

Matrix is too sparse after mapping #374