Closed abhisheksinghnl closed 5 years ago
Which command-line did you use to build the matrices and do the QC please?
0% dangling end and a lot of "same fragment" is extremely weird.
Did you do some test/shallow sequencing before sequencing these hundred million reads? Usually it is the best way to check if "everything went fine" on the wet-lab part. Like a 5mil reads sequencing. Usually QC don't differ so much from fully sequenced data from the same library.
This could be experimental, but having the command lines used would help. Did you try making a restriction fragment length matrix? Usually it helps to better asses the quality of the libraries.
Hi,
here is the buildhicmatrix command
hicBuildMatrix --samFiles log9.bam log10.bam --QCfolder hicU2 -bs 25000 --outBam U2.bam --minDistance 300 --maxLibraryInsertSize 1000 --restrictionSequence GATC --outFileName Matrix.HiC.U2.cool --minMappingQuality 5 --skipDuplicationCheck --threads 8 -ze hicU2.err -zmem 700G -zslot 8 -za
and for QC
hicQC --logfiles /hicT1/QC.log /hicT2/QC.log /hicU1/QC.log /hicU2/QC.log --labels "T1" "T2" "U1" "U2" --outputFolder QC_all_samples
Please try the following:
hicBuildMatrix --samFiles log9.bam log10.bam --QCfolder hicU2 --outBam U2.bam --restrictionSequence GATC --restrictionCutFile GATC.bed --outFileName Matrix.HiC.U2.cool --skipDuplicationCheck --threads 8 -ze hicU2.err -zmem 700G -zslot 8 -za
hicQC --logfiles /hicT1/QC.log /hicT2/QC.log /hicU1/QC.log /hicU2/QC.log --labels "T1" "T2" "U1" "U2" --outputFolder QC_all_samples
To get GATC.bed please use findResSites
https://hicexplorer.readthedocs.io/en/latest/content/tools/findRestSite.html#findrestsite
By the way, GATC is the correct sequence recognised by the restriction enzyme used in the protocol right (DpnII)?
Usually the fraction of pairs used is 0.50 (50%).
Hi,
Thanks for the edit in the command. i will try it and get back to you.
For the second one yes the RE is DpnII.
Fingers crossed.
Please try the following:
hicBuildMatrix --samFiles log9.bam log10.bam --QCfolder hicU2 --outBam U2.bam --restrictionSequence GATC --restrictionCutFile GATC.bed --outFileName Matrix.HiC.U2.cool --skipDuplicationCheck --threads 8 -ze hicU2.err -zmem 700G -zslot 8 -za hicQC --logfiles /hicT1/QC.log /hicT2/QC.log /hicU1/QC.log /hicU2/QC.log --labels "T1" "T2" "U1" "U2" --outputFolder QC_all_samples
To get GATC.bed please use
findResSites
https://hicexplorer.readthedocs.io/en/latest/content/tools/findRestSite.html#findrestsiteBy the way, GATC is the correct sequence recognised by the restriction enzyme used in the protocol right (DpnII)?
Usually the fraction of pairs used is 0.50 (50%).
Hi,
Without giving the bin size I run into memory issues. Even 700Gb was not enough :(
Can you try to split the bam files in half, just for the sake of testing?
Hi,
I used the following command and increased the memory top 1000GB
hicBuildMatrix --samFiles log1.bam log2.bam --QCfolder hicT1 --outBam T1.bam --restrictionSequence GATC --restrictionCutFile rest_site_positions.bed --danglingSequence GATC --outFileName New.Matrix.HiC.T1.cool --threads 8
I got following error at the end
INFO:hicmatrix.HiCMatrix:Only init object, no matrix given.
Traceback (most recent call last):
File "/tools/eb/software/Miniconda3/4.4.10/envs/hicexplorer_new/bin/hicBuildMatrix", line 7, in <module>
main()
File "/tools/eb/software/Miniconda3/4.4.10/envs/hicexplorer_new/lib/python3.6/site-packages/hicexplorer/hicBuildMatrix.py", line 1381, in main
if args.outFileName.name.endswith('.cool') and args.binSize is not None and len(args.binSize) > 2:
TypeError: object of type 'int' has no len()
what does this indicates and how can I resolve it.
Please guide.
Thank you.
Hi,
you can decrease the number of used cores to 4 - 6, if I/O is not really fast there will be no benefit and it will waste memory. Moreover, to speed up the processing time and save memory, don't use the --outBam
parameter.
The error you get is caused by a bug from our side. I will try to provide a patched version today. However, if you don't want to wait, you can run, as you did in the beginning, with a fixed binSize resolution instead restriction enzyme cut level.
Best,
Joachim
Hi,
you can decrease the number of used cores to 4 - 6, if I/O is not really fast there will be no benefit and it will waste memory. Moreover, to speed up the processing time and save memory, don't use the
--outBam
parameter.The error you get is caused by a bug from our side. I will try to provide a patched version today. However, if you don't want to wait, you can run, as you did in the beginning, with a fixed binSize resolution instead restriction enzyme cut level.
Best,
Joachim
Hi Joachim,
thank's for the explanation.
The primary question still remains unanswered and that is what could be the reason for having such a small percentage if used pairs? Is it because of the chimeric reads being treated as PCR duplicates or multimapping of reads ? If yes, how can I fix it?
My guess is allowing multimapping might solve my issue (I have a hexaploidy genome from plant). Could you please let me know, how to allow multimapped reads.
Could you please give a guess.
Best regards Abhishek
I think it is more than in a lot of read pairs, the restriction enzyme site cannot be detected, explaining while you have a lot of "same fragment" (meaning your read pair is only composed of 1 locus in the genome instead of 2 loci, separated by GATC). What is the average fragment length of the library (Bioanalyzer profile)? What is the reads length?
You also have a lot of low quality reads, I think it has to do with your --minMappingQuality 5
threshold. Try increasing it or keeping defaults parameters.
You also have a lot of multimapped reads, which is a problem since HiCExlorer is only keeping single mapped reads.
If you take already published fastq files and apply the same command-lines, what do you obtain?
I think it is more than in a lot of read pairs, the restriction enzyme site cannot be detected, explaining while you have a lot of "same fragment" (meaning you read pair is only composed of 1 locus in the genome instead of 2 loci).
You also have a lot of low quality reads, I think it has to do with your MAPQ threshold.
You also have a lot of multimapped reads, which is a problem since HiCExlorer is only keeping single mapped reads.
If you take already published fastq files and apply the same command-lines, what do you obtain?
Hi Richard,
thank you for your reply.
Okay, here is my plan and I need your help for second part.
I will modify the --minMappingQuality from 5 to 15.
How can I allow multimapped reads?
Hi,
after some discussions in our team:
Best,
Joachim
Is it because of the chimeric reads being treated as PCR duplicates or multimapping of reads ? If yes, how can I fix it?
The whole Hi-C method works only because of chimeric reads. We consider something as PCR duplicate as soon as two reads map to the identical position which is quite unlikely to happen. The classification of a multimapped read is not done by HiCExplorer it is a decision from your mapper. We think the mapping step is the one which goes wrong or the data contains errors and is of bad quality.
The second big issue in T2, U1 and U2 is 'same_fragment': We consider one chimeric read as one fragment if their mapped position differs less than maxLibrarySize
, which you set to 1000 base pairs and is our suggestion. We don't know why you have that many reads which are more or less non-chimeric. We strongly recommend to test other, already published data from the same organizm with HiCExplorer to figure out if the error is caused by HiCExplorer, the data from a genome like yours behaves this way or something went wrong creating this data.
Hi Joachim,
Just to make it clear, I have used following parameters for alignment using bwa mem:
bwa mem -A1 -B4 -E50 -L0
Would you suggest any additional parameters or tweak in the present ones?
Thank you
On Tue, Mar 19, 2019 at 2:13 PM Joachim Wolff notifications@github.com wrote:
Is it because of the chimeric reads being treated as PCR duplicates or multimapping of reads ? If yes, how can I fix it?
The whole Hi-C method works only because of chimeric reads. We consider something as PCR duplicate as soon as two reads map to the identical position which is quite unlikely to happen. The classification of a multimapped read is not done by HiCExplorer it is a decision from your mapper. We think the mapping step is the one which goes wrong or the data contains errors and is of bad quality.
The second big issue in T2, U1 and U2 is 'same_fragment': We consider one chimeric read as one fragment if their mapped position differs less than maxLibrarySize, which you set to 1000 base pairs and is our suggestion. We don't know why you have that many reads which are more or less non-chimeric. We strongly recommend to test other, already published data from the same organizm with HiCExplorer to figure out if the error is caused by HiCExplorer, the data from a genome like yours behaves this way or something went wrong creating this data.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deeptools/HiCExplorer/issues/361#issuecomment-474360560, or mute the thread https://github.com/notifications/unsubscribe-auth/AUAA5wDjaib5_L8DObIL7FDSOJOWY0A-ks5vYOKEgaJpZM4b5VDj .
Hi Abhishek,
please try different mappers like bowtie or hisat and use their --reorder
parameter. I don't have experience with plant genomes, maybe you need to adjust there the parameters. I don't know how good the three from us tested mappers (bowtie, hisat and BWA) are for plant genomes, maybe they don't work at all for them, or there is something special you need to have in mind for creating the reference genome. Please check also all the trivial issues: correct reference genome version, single-end mapping, fastqc report.
My apologies I cannot help you here more. Ping @bgruening any clue?
Best,
Joachim
After some more discussions: Can you map the reads, write all non-unique reads to a file and investigate these via IGV? There should also be the option in the mapper to remove the non-unique reads.
Hi Joachim,
Thank you for your email.
I will give it a try.
Presently, I am trying the mHiC https://github.com/yezhengSTAT/mHiC
have a look at the way author using bwa to map reads. May be that will make the difference.
Do give your inputs.
Best regards Abhishek
On Thu, Mar 21, 2019 at 4:27 PM Joachim Wolff notifications@github.com wrote:
After some more discussions: Can you map the reads, write all non-unique reads to a file and investigate these via IGV? There should also be the option in the mapper to remove the non-unique reads.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deeptools/HiCExplorer/issues/361#issuecomment-475276240, or mute the thread https://github.com/notifications/unsubscribe-auth/AUAA5yS1AQto3v9wiWGEr8N7UKY4wRvtks5vY6TTgaJpZM4b5VDj .
Hi Joachim,
I did figure it out, what might be happening.
Is it possible to change the definition of "informative reads", for example if it was between 2kb to 10kb on same chromosome to all distances higher than 2kb on the same chromosome?
How and at which step I can edit this?
Best regards Abhishek
On Thu, Mar 21, 2019 at 5:34 PM Abhishek Singh abhisheksinghnl@gmail.com wrote:
Hi Joachim,
Thank you for your email.
I will give it a try.
Presently, I am trying the mHiC https://github.com/yezhengSTAT/mHiC
have a look at the way author using bwa to map reads. May be that will make the difference.
Do give your inputs.
Best regards Abhishek
On Thu, Mar 21, 2019 at 4:27 PM Joachim Wolff notifications@github.com wrote:
After some more discussions: Can you map the reads, write all non-unique reads to a file and investigate these via IGV? There should also be the option in the mapper to remove the non-unique reads.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deeptools/HiCExplorer/issues/361#issuecomment-475276240, or mute the thread https://github.com/notifications/unsubscribe-auth/AUAA5yS1AQto3v9wiWGEr8N7UKY4wRvtks5vY6TTgaJpZM4b5VDj .
I am also struggling to understand why most experiments I QC are returning very low "pairs used" (<20%), I pipeline hicExplorer with snakemake so it's fairly reproducible. First the data are aligned using bowtie2, the only non-default parameter being --reorder
. Then, here is my build matrix rule:
rule build_matrix:
input:
"data/aligned/{sample}_R1.bam", "data/aligned/{sample}_R2.bam"
output:
"data/buildMatrix/{sample}_10kb.h5"
conda:
"envs/hic.yml"
threads: 16
shell:
"hicBuildMatrix --samFiles {input[0]} {input[1]} "
"-bs 10000 "
"--outBam data/buildMatrix/{wildcards.sample}_ref.bam "
"--outFileName {output} "
"--QCfolder data/buildMatrix/{wildcards.sample}_QC "
"--threads {threads} "
"--inputBufferSize 500000"
With the above code I have run on a number of datasets. The dataset from Rao et al. 2014 having the highest number of "pairs used".
Best case avg 43% pairs used:
Worst case avg 17% pairs used:
It seems like it could be due to differences in HiC library generation, as I have not seen > 20% used pairs with the multiple datasets generated using the [Arima Genomics HiC kit](https://arimagenomics.com/kit/.
You might need to check your mapping parameters. You need to set a high gap extension penalty I would say. In fact, we already have a HiC snakemake pipeline which you might be interested in running your data using that and compare the results with yours. You can find it here: https://github.com/maxplanck-ie/snakepipes.git
. We used bwa mem
though and we also mapped each mate of a paired end reads separately since the mates may come from distant regions on a genome.
Please post the exact files you used. It is for us really difficult to understand if your dataset is maybe bad or there is an issue with HiCExplorer.
Hi,
I have mapped and build hicmatrix as suggested in the manual and when I run the qc I get very low percentage of pair used.
I was just wondering what could be the possible reason for this.
Is it because of the chimeric reads that are also being treated as PCR duplicates or something else technical that is leading to this large number of unmapped reads or low percentage of mapped reads?
Please help as I need to explain this to my fellow wet lab guys. They strongly believe that it is more of a computational thing while I believe it is more of experimental mix up.
Please find the table in the attachment.
Thank you