deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
233 stars 70 forks source link

Very low percentage of pair used #361

Closed abhisheksinghnl closed 5 years ago

abhisheksinghnl commented 5 years ago

Hi,

DiscardedTable UnmappedTable I have mapped and build hicmatrix as suggested in the manual and when I run the qc I get very low percentage of pair used.

I was just wondering what could be the possible reason for this.

Is it because of the chimeric reads that are also being treated as PCR duplicates or something else technical that is leading to this large number of unmapped reads or low percentage of mapped reads?

Please help as I need to explain this to my fellow wet lab guys. They strongly believe that it is more of a computational thing while I believe it is more of experimental mix up.

Please find the table in the attachment.

Thank you

gtrichard commented 5 years ago

Which command-line did you use to build the matrices and do the QC please?

0% dangling end and a lot of "same fragment" is extremely weird.

Did you do some test/shallow sequencing before sequencing these hundred million reads? Usually it is the best way to check if "everything went fine" on the wet-lab part. Like a 5mil reads sequencing. Usually QC don't differ so much from fully sequenced data from the same library.

This could be experimental, but having the command lines used would help. Did you try making a restriction fragment length matrix? Usually it helps to better asses the quality of the libraries.

abhisheksinghnl commented 5 years ago

Hi,

here is the buildhicmatrix command

hicBuildMatrix --samFiles log9.bam log10.bam --QCfolder hicU2 -bs 25000 --outBam U2.bam --minDistance 300 --maxLibraryInsertSize 1000 --restrictionSequence GATC --outFileName Matrix.HiC.U2.cool --minMappingQuality 5 --skipDuplicationCheck --threads 8 -ze hicU2.err -zmem 700G -zslot 8 -za

and for QC

hicQC --logfiles /hicT1/QC.log /hicT2/QC.log /hicU1/QC.log /hicU2/QC.log --labels "T1" "T2" "U1" "U2" --outputFolder QC_all_samples

gtrichard commented 5 years ago

Please try the following:

hicBuildMatrix --samFiles log9.bam log10.bam --QCfolder hicU2 --outBam U2.bam --restrictionSequence GATC --restrictionCutFile GATC.bed --outFileName Matrix.HiC.U2.cool --skipDuplicationCheck --threads 8 -ze hicU2.err -zmem 700G -zslot 8 -za

hicQC --logfiles /hicT1/QC.log /hicT2/QC.log /hicU1/QC.log /hicU2/QC.log --labels "T1" "T2" "U1" "U2" --outputFolder QC_all_samples

To get GATC.bed please use findResSites https://hicexplorer.readthedocs.io/en/latest/content/tools/findRestSite.html#findrestsite

By the way, GATC is the correct sequence recognised by the restriction enzyme used in the protocol right (DpnII)?

Usually the fraction of pairs used is 0.50 (50%).

abhisheksinghnl commented 5 years ago

Hi,

Thanks for the edit in the command. i will try it and get back to you.

For the second one yes the RE is DpnII.

Fingers crossed.

abhisheksinghnl commented 5 years ago

Please try the following:

hicBuildMatrix --samFiles log9.bam log10.bam --QCfolder hicU2 --outBam U2.bam --restrictionSequence GATC --restrictionCutFile GATC.bed --outFileName Matrix.HiC.U2.cool --skipDuplicationCheck --threads 8 -ze hicU2.err -zmem 700G -zslot 8 -za

hicQC --logfiles /hicT1/QC.log /hicT2/QC.log /hicU1/QC.log /hicU2/QC.log --labels "T1" "T2" "U1" "U2" --outputFolder QC_all_samples

To get GATC.bed please use findResSites https://hicexplorer.readthedocs.io/en/latest/content/tools/findRestSite.html#findrestsite

By the way, GATC is the correct sequence recognised by the restriction enzyme used in the protocol right (DpnII)?

Usually the fraction of pairs used is 0.50 (50%).

Hi,

Without giving the bin size I run into memory issues. Even 700Gb was not enough :(

gtrichard commented 5 years ago

Can you try to split the bam files in half, just for the sake of testing?

abhisheksinghnl commented 5 years ago

Hi,

I used the following command and increased the memory top 1000GB

hicBuildMatrix --samFiles log1.bam log2.bam --QCfolder hicT1 --outBam T1.bam --restrictionSequence GATC --restrictionCutFile rest_site_positions.bed --danglingSequence GATC --outFileName New.Matrix.HiC.T1.cool --threads 8

I got following error at the end

INFO:hicmatrix.HiCMatrix:Only init object, no matrix given.
Traceback (most recent call last):
  File "/tools/eb/software/Miniconda3/4.4.10/envs/hicexplorer_new/bin/hicBuildMatrix", line 7, in <module>
    main()
  File "/tools/eb/software/Miniconda3/4.4.10/envs/hicexplorer_new/lib/python3.6/site-packages/hicexplorer/hicBuildMatrix.py", line 1381, in main
    if args.outFileName.name.endswith('.cool') and args.binSize is not None and len(args.binSize) > 2:
TypeError: object of type 'int' has no len()

what does this indicates and how can I resolve it.

Please guide.

Thank you.

joachimwolff commented 5 years ago

Hi,

you can decrease the number of used cores to 4 - 6, if I/O is not really fast there will be no benefit and it will waste memory. Moreover, to speed up the processing time and save memory, don't use the --outBam parameter.

The error you get is caused by a bug from our side. I will try to provide a patched version today. However, if you don't want to wait, you can run, as you did in the beginning, with a fixed binSize resolution instead restriction enzyme cut level.

Best,

Joachim

abhisheksinghnl commented 5 years ago

Hi,

you can decrease the number of used cores to 4 - 6, if I/O is not really fast there will be no benefit and it will waste memory. Moreover, to speed up the processing time and save memory, don't use the --outBam parameter.

The error you get is caused by a bug from our side. I will try to provide a patched version today. However, if you don't want to wait, you can run, as you did in the beginning, with a fixed binSize resolution instead restriction enzyme cut level.

Best,

Joachim

Hi Joachim,

thank's for the explanation.

The primary question still remains unanswered and that is what could be the reason for having such a small percentage if used pairs? Is it because of the chimeric reads being treated as PCR duplicates or multimapping of reads ? If yes, how can I fix it?

My guess is allowing multimapping might solve my issue (I have a hexaploidy genome from plant). Could you please let me know, how to allow multimapped reads.

Could you please give a guess.

Best regards Abhishek

gtrichard commented 5 years ago

I think it is more than in a lot of read pairs, the restriction enzyme site cannot be detected, explaining while you have a lot of "same fragment" (meaning your read pair is only composed of 1 locus in the genome instead of 2 loci, separated by GATC). What is the average fragment length of the library (Bioanalyzer profile)? What is the reads length?

You also have a lot of low quality reads, I think it has to do with your --minMappingQuality 5 threshold. Try increasing it or keeping defaults parameters.

You also have a lot of multimapped reads, which is a problem since HiCExlorer is only keeping single mapped reads.

If you take already published fastq files and apply the same command-lines, what do you obtain?

abhisheksinghnl commented 5 years ago

I think it is more than in a lot of read pairs, the restriction enzyme site cannot be detected, explaining while you have a lot of "same fragment" (meaning you read pair is only composed of 1 locus in the genome instead of 2 loci).

You also have a lot of low quality reads, I think it has to do with your MAPQ threshold.

You also have a lot of multimapped reads, which is a problem since HiCExlorer is only keeping single mapped reads.

If you take already published fastq files and apply the same command-lines, what do you obtain?

Hi Richard,

thank you for your reply.

Okay, here is my plan and I need your help for second part.

  1. I will modify the --minMappingQuality from 5 to 15.

  2. How can I allow multimapped reads?

joachimwolff commented 5 years ago

Hi,

after some discussions in our team:

Best,

Joachim

joachimwolff commented 5 years ago

Is it because of the chimeric reads being treated as PCR duplicates or multimapping of reads ? If yes, how can I fix it?

The whole Hi-C method works only because of chimeric reads. We consider something as PCR duplicate as soon as two reads map to the identical position which is quite unlikely to happen. The classification of a multimapped read is not done by HiCExplorer it is a decision from your mapper. We think the mapping step is the one which goes wrong or the data contains errors and is of bad quality.

The second big issue in T2, U1 and U2 is 'same_fragment': We consider one chimeric read as one fragment if their mapped position differs less than maxLibrarySize, which you set to 1000 base pairs and is our suggestion. We don't know why you have that many reads which are more or less non-chimeric. We strongly recommend to test other, already published data from the same organizm with HiCExplorer to figure out if the error is caused by HiCExplorer, the data from a genome like yours behaves this way or something went wrong creating this data.

abhisheksinghnl commented 5 years ago

Hi Joachim,

Just to make it clear, I have used following parameters for alignment using bwa mem:

bwa mem -A1 -B4 -E50 -L0

Would you suggest any additional parameters or tweak in the present ones?

Thank you

On Tue, Mar 19, 2019 at 2:13 PM Joachim Wolff notifications@github.com wrote:

Is it because of the chimeric reads being treated as PCR duplicates or multimapping of reads ? If yes, how can I fix it?

The whole Hi-C method works only because of chimeric reads. We consider something as PCR duplicate as soon as two reads map to the identical position which is quite unlikely to happen. The classification of a multimapped read is not done by HiCExplorer it is a decision from your mapper. We think the mapping step is the one which goes wrong or the data contains errors and is of bad quality.

The second big issue in T2, U1 and U2 is 'same_fragment': We consider one chimeric read as one fragment if their mapped position differs less than maxLibrarySize, which you set to 1000 base pairs and is our suggestion. We don't know why you have that many reads which are more or less non-chimeric. We strongly recommend to test other, already published data from the same organizm with HiCExplorer to figure out if the error is caused by HiCExplorer, the data from a genome like yours behaves this way or something went wrong creating this data.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deeptools/HiCExplorer/issues/361#issuecomment-474360560, or mute the thread https://github.com/notifications/unsubscribe-auth/AUAA5wDjaib5_L8DObIL7FDSOJOWY0A-ks5vYOKEgaJpZM4b5VDj .

joachimwolff commented 5 years ago

Hi Abhishek,

please try different mappers like bowtie or hisat and use their --reorder parameter. I don't have experience with plant genomes, maybe you need to adjust there the parameters. I don't know how good the three from us tested mappers (bowtie, hisat and BWA) are for plant genomes, maybe they don't work at all for them, or there is something special you need to have in mind for creating the reference genome. Please check also all the trivial issues: correct reference genome version, single-end mapping, fastqc report.

My apologies I cannot help you here more. Ping @bgruening any clue?

Best,

Joachim

joachimwolff commented 5 years ago

After some more discussions: Can you map the reads, write all non-unique reads to a file and investigate these via IGV? There should also be the option in the mapper to remove the non-unique reads.

abhisheksinghnl commented 5 years ago

Hi Joachim,

Thank you for your email.

I will give it a try.

Presently, I am trying the mHiC https://github.com/yezhengSTAT/mHiC

have a look at the way author using bwa to map reads. May be that will make the difference.

Do give your inputs.

Best regards Abhishek

On Thu, Mar 21, 2019 at 4:27 PM Joachim Wolff notifications@github.com wrote:

After some more discussions: Can you map the reads, write all non-unique reads to a file and investigate these via IGV? There should also be the option in the mapper to remove the non-unique reads.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deeptools/HiCExplorer/issues/361#issuecomment-475276240, or mute the thread https://github.com/notifications/unsubscribe-auth/AUAA5yS1AQto3v9wiWGEr8N7UKY4wRvtks5vY6TTgaJpZM4b5VDj .

abhisheksinghnl commented 5 years ago

Hi Joachim,

I did figure it out, what might be happening.

Is it possible to change the definition of "informative reads", for example if it was between 2kb to 10kb on same chromosome to all distances higher than 2kb on the same chromosome?

How and at which step I can edit this?

Best regards Abhishek

On Thu, Mar 21, 2019 at 5:34 PM Abhishek Singh abhisheksinghnl@gmail.com wrote:

Hi Joachim,

Thank you for your email.

I will give it a try.

Presently, I am trying the mHiC https://github.com/yezhengSTAT/mHiC

have a look at the way author using bwa to map reads. May be that will make the difference.

Do give your inputs.

Best regards Abhishek

On Thu, Mar 21, 2019 at 4:27 PM Joachim Wolff notifications@github.com wrote:

After some more discussions: Can you map the reads, write all non-unique reads to a file and investigate these via IGV? There should also be the option in the mapper to remove the non-unique reads.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deeptools/HiCExplorer/issues/361#issuecomment-475276240, or mute the thread https://github.com/notifications/unsubscribe-auth/AUAA5yS1AQto3v9wiWGEr8N7UKY4wRvtks5vY6TTgaJpZM4b5VDj .

jakevc commented 5 years ago

I am also struggling to understand why most experiments I QC are returning very low "pairs used" (<20%), I pipeline hicExplorer with snakemake so it's fairly reproducible. First the data are aligned using bowtie2, the only non-default parameter being --reorder. Then, here is my build matrix rule:

rule build_matrix:
    input:
        "data/aligned/{sample}_R1.bam",  "data/aligned/{sample}_R2.bam"
    output:
        "data/buildMatrix/{sample}_10kb.h5"
    conda:
        "envs/hic.yml"
    threads: 16
    shell:
        "hicBuildMatrix --samFiles {input[0]} {input[1]} "
            "-bs 10000 "
            "--outBam data/buildMatrix/{wildcards.sample}_ref.bam "
            "--outFileName {output} "
            "--QCfolder data/buildMatrix/{wildcards.sample}_QC "
            "--threads {threads} "
            "--inputBufferSize 500000"

With the above code I have run on a number of datasets. The dataset from Rao et al. 2014 having the highest number of "pairs used".

Best case avg 43% pairs used:

hicexplorer_categorization_plot

Worst case avg 17% pairs used:

hicexplorer_categorization_plot (1)

It seems like it could be due to differences in HiC library generation, as I have not seen > 20% used pairs with the multiple datasets generated using the [Arima Genomics HiC kit](https://arimagenomics.com/kit/.

LeilyR commented 5 years ago

You might need to check your mapping parameters. You need to set a high gap extension penalty I would say. In fact, we already have a HiC snakemake pipeline which you might be interested in running your data using that and compare the results with yours. You can find it here: https://github.com/maxplanck-ie/snakepipes.git. We used bwa mem though and we also mapped each mate of a paired end reads separately since the mates may come from distant regions on a genome.

joachimwolff commented 5 years ago

Please post the exact files you used. It is for us really difficult to understand if your dataset is maybe bad or there is an issue with HiCExplorer.