3DGenomes / TADbit

TADbit is a complete Python library to deal with all steps to analyze, model and explore 3C-based data. With TADbit the user can map FASTQ files to obtain raw interaction binned matrices (Hi-C like matrices), normalize and correct interaction matrices, identify and compare the so-called Topologically Associating Domains (TADs), build 3D models from the interaction matrices, and finally, extract structural properties from the models. TADbit is complemented by TADkit for visualizing 3D models
GNU General Public License v3.0
100 stars 61 forks source link

Matrix is too sparse after mapping #374

Closed manuelfmerino closed 2 years ago

manuelfmerino commented 2 years ago

Hello,

I am trying to process the data from a capture Hi-C experiment of human chromosome 12. I am following the steps from the tutorial using data from a publication, trying to reach to similar results to those they have. I'm using fragment-based mapping. My problem is that after keeping the uniquely interacting read pairs, the number of reads decreases enormously, and my interaction matrices become somewhat sparse. Here are some examples:

  1. Original number of reads in downloaded dataset (as provided by the publication): 84,386,237
  2. My number of uniquely mapped interactions (in each end). End 1: 10,513,780. End 2: 10,429,881
  3. My number of uniquely mapped read pairs (result of get_intersection): 6,331,686

As you can see, the number of uniquely mapped read pairs is less than 10% of the original number of reads. However, in the publication I'm taking as a reference, the number of uniquely mapped pairs is 52,386,237 at this stage of the pipeline, which I believe is a more reasonable number. Am I right and my numbers are just too low?

Some information on my procedure:

Fatal error (gem-indexer_fasta2meta+cont.c:368,main) Malformed FASTA/FASTQ file (sequence #1)

I tried to solve it but didn't find much on the Internet and gave up. Installed GEM3 instead using conda (I performed the whole TADbit installation on conda), and it worked without a problem. I assumed it was okay, but since now I'm facing weird results, I'm starting to wonder whether the Chr12 file I used as a reference genome is correct, or if a different version of GEM might be causing this.

Just in case, I'm running TADbit 1.0.1 (the latest available version on conda). I also redownloaded the files and reference genome and reran everything. obtaining identical results.

Any help would be more than welcome, I've been struggling with this for a few days now.

Thanks a lot, Manuel F. Merino

david-castillo commented 2 years ago

Hi Manuel,

The main difference between gem2 and gem3 is that they allow different mismatches in the mapping of the reads and therefore we cannot expect the same number of mapped reads. But that does not explain big differences in the numbers.

In my view your main problem comes from step 1 to step 2, the mapping of the reads which is suspiciously low. Which publication are you using to test?

Regards

David

manuelfmerino commented 2 years ago

Hi David,

Thanks a lot for your answer. I agree that the problem likely comes from these steps. The publication I'm following is currently under review (by some collaborators of ours). I'm having trouble reaching who was in charge of the capture hi-c data processing, and figured it would be faster to try and reprocess the data myself. While the article is still not public, the dataset is, and can be found here: https://www.ebi.ac.uk/ena/browser/view/PRJEB42293

Cheers, Manuel