deeptools / HiCExplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
https://hicexplorer.readthedocs.org
GNU General Public License v3.0
231 stars 70 forks source link

Sorting error with bwa Hi-C mode aligned files #777

Open Norbittner opened 2 years ago

Norbittner commented 2 years ago

Hi all,

i have a bit of a problem when using a modified approach of aligning my Hi-C data with the bwa-mem algorithm.

I'm using HiCExplorer version 3.7.2 and python 3.8.0

When aligning the paired sequences separate with the settings as proposed in your documentation everything runs fine.

However we want to align our sequences using the "Hi-C" option -5 for our data.

bwa mem -t 32 -5SP /../hg38.p13.fa.gz /../L79851_Track-123454_R1.fastq.gz /../L79851_Track-123454_R2.fastq.gz 2> L79851_Track-123454.bwa.log | samtools view -S -h -b -F2316 > L79851_Track-123454.aligned.bam

After the alignment i split the bam file with the following code

samtools view -h -f 0x40 /../../L79851_Track-123454.aligned.bam -o L79851_Track-123454.aligned.unsorted.R1.bam samtools view -h -f 0x80 /../../L79851_Track-123454.aligned.bam -o L79851_Track-123454.aligned.unsorted.R2.bam

these files are then sorted with the -n flag of samtools sort.

When trying to run hiCbuildmatrix with the same command that already worked

hicBuildMatrix --samFiles L79851_Track-123454.aligned.R1.sort.bam L79851_Track-123454.aligned.R2.sort.bam --outFileName L79851_Track-123454.sort.contactMatrix.h5 --QCfolder QC.L79851_Track-123454 --restrictionCutFile ../../../../hicFindRestSite.build38.bed --restrictionSequence GATC GANTC CTNAG TTAA --danglingSequence GATC ANT TNA TA

I'm getting an assertion error

`[E::idx_find_and_load] Could not retrieve index file for 'L79851_Track-123454.aligned.R1.sort.bam' [E::idx_find_and_load] Could not retrieve index file for 'L79851_Track-123454.aligned.R2.sort.bam' INFO:hicexplorer.hicBuildMatrix:dangling sequences to check are {'GATC': {'pat_forw': 'GATC', 'pat_rev': 'GATC'}, 'GANTC': {'pat_forw': 'ANT', 'pat_rev': 'ANT'}, 'CTNAG': {'pat_forw': 'TNA', 'pat_rev': 'TNA'}, 'TTAA': {'pat_forw': 'TA', 'pat_rev': 'TA'}}

Traceback (most recent call last): File "/opt/conda/bin/hicBuildMatrix", line 7, in main() File "/opt/conda/lib/python3.8/site-packages/hicexplorer/hicBuildMatrix.py", line 1299, in main one_mate_lowquality, iternum = readBamFiles(pFileOneIterator=str1, File "/opt/conda/lib/python3.8/site-packages/hicexplorer/hicBuildMatrix.py", line 719, in readBamFiles assert mate1.qname == mate2.qname, "FATAL ERROR {} {} " \ AssertionError: FATAL ERROR A01182:63:HKW5YDSX2:2:1101:1000:2127 A01182:63:HKW5YDSX2:2:1101:1000:2096 Be sure that the sam files have the same read order If using Bowtie2 or Hisat2 add the --reorder option`

When checking the sequence order of the files with the bash comm command there are no differing lines. The sequences in the Error message are at the beginning of each of the BAM files and are at the same position.

Head output of R1 and R2 headR1sort.txt headR2sort.txt are attached.

Issues regarding this error were already posted and closed (#349 , #387 , #562 , #703 and #673) . For the last two ones there is no solution posted as there was personal communication.

I'm looking forward to your comments.

All the best Norbert

PS: In your documentation https://hicexplorer.readthedocs.io/en/latest/content/tools/hicFindRestSite.html you could update that multiple restriction enzymes now work, i found it here in an issue on GitHub that the function is already implemented.

joachimwolff commented 2 years ago

Dear Norbert,

Thanks a lot for the detailed report and feedback. It looks to me I need to dig a bit deeper to come to a satisfying solution and or explanation.

I will try to comeback to you as fast as possible.

Best,

Joachim

Norbittner commented 2 years ago

Hi @joachimwolff

the problem with the sorting error is now solved. In parallel to the Juicer pipeline i used the above mentioned code but added the -M flag (which is marking shorter splits as secondary according to the bwa manual - used for Picard) now running the alignment with the following code

bwa mem -t 32 -5SPM /../hg38.p13.fa.gz /../L79851_Track-123454_R1.fastq.gz /../L79851_Track-123454_R2.fastq.gz > L79851_Track-123454.aligned.bam

Following this the aligned file is sorted like in Juicer with

samtools sort -t cb -n L79851_Track-123454.aligned.bam > L79851_Track-123454.aligned.sorted.bam

and then split into two files again with the code used in the initial post

samtools view -h -f 0x40 /../../L79851_Track-123454.aligned.bam -o L79851_Track-123454.aligned.sorted.R1.sam samtools view -h -f 0x80 /../../L79851_Track-123454.aligned.bam -o L79851_Track-123454.aligned.sorted.R2.sam

Using this approach with the "Hi-C mode" of bwa then works - HiCBuildMatrix runs and starts fine without any error.

(Maybe something for another issue: The results of the QC differ greatly between Juicer and HiCExplorer although using the same alignment file, any explanation on that ?)

All the best and greets Norbert

AkaneKawaguchi commented 1 month ago

**Please help! In last three months, I have been in trouble with the same issue as below (as well as https://github.com/deeptools/HiCExplorer/issues/#703, #777). It worked before with other HiC datasets, and now hicbuildmatrics does not work anymore with any data-set.

hicBuildMatrix --samFiles read1_sorted_by_n.bam read2_sorted_by_n.bam \ --outFileName sample.h5 \ --QCfolder sample \ --restrictionCutFile Dpn2_positions.bed \ --restrictionSequence GATC \ --danglingSequence GATC \ --outBam sample.bam \ --minMappingQuality 15 \ --threads 20 INFO:hicexplorer.hicBuildMatrix:Minimum distance considered between restriction sites is 300 Max distance: 1000

INFO:hicexplorer.hicBuildMatrix:dangling sequences to check are {'GATC': {'pat_forw': 'GATC', 'pat_rev': 'GATC'}}

Traceback (most recent call last): File "/home/my_name/.conda/envs/HiCexp/bin/hicBuildMatrix", line 7, in main() File "/home/my_name/.conda/envs/HiCexp/lib/python3.8/site-packages/hicexplorer/hicBuildMatrix.py", line 1290, in main one_mate_lowquality, iternum = readBamFiles(pFileOneIterator=str1, File "/home/my_name/.conda/envs/HiCexp/lib/python3.8/site-packages/hicexplorer/hicBuildMatrix.py", line 710, in readBamFiles assert mate1.qname == mate2.qname, "FATAL ERROR {} {} " \ AssertionError: FATAL ERROR LH00504:25:22KVL7LT3:7:1101:1148:21124 LH00504:25:22KVL7LT3:7:1101:1148:22470 Be sure that the sam files have the same read order If using Bowtie2 or Hisat2 add the --reorder optionAssertionError: FATAL ERROR LH00504:26:22MMKMLT3:1:1101:16644:1096 LH00504:26:22MMKMLT3:1:1101:17051:1096 Be sure that the sam files have the same read order If using Bowtie2 or Hisat2 add the --reorder option

I used bowtie2 to map paired HiC reads as a single read file with --reorder option: for i in 1 2 ; do bowtie2 -x genome_index -U my read"$i".fq -S read"$i".sam --reorder ; done for i in 1 2 ; do samtools view -Shb read"$i".sam > read"$i".bam ; done *then sorted by -n: for i in 1 2 ; do samtools sort -n -m 4G -o read"$i"_sorted_by_n.bam read"$i".bam ; done

joachimwolff commented 1 month ago

Before I dive deep into the issue, you write you use HiCExplorer version 3.6. Please try the latest 3.7.5.

AkaneKawaguchi @.***> schrieb am Di. 13. Aug. 2024 um 12:57:

Please help! In last three months, I have been in trouble with the same issue as below (as well as https://github.com/deeptools/HiCExplorer/issues/#703, #777 https://github.com/deeptools/HiCExplorer/issues/777). It worked before with other HiC datasets, and now hicbuildmatrics does not work anymore with any data-set.

hicBuildMatrix --samFiles read1_sorted_by_n.bam read2_sorted_by_n.bam --outFileName sample.h5 --QCfolder sample --restrictionCutFile Dpn2_positions.bed --restrictionSequence GATC --danglingSequence GATC --outBam sample.bam --minMappingQuality 15 --threads 20 INFO:hicexplorer.hicBuildMatrix:Minimum distance considered between restriction sites is 300 Max distance: 1000

INFO:hicexplorer.hicBuildMatrix:dangling sequences to check are {'GATC': {'pat_forw': 'GATC', 'pat_rev': 'GATC'}} Traceback (most recent call last): File "/home/my_name/.conda/envs/HiCexp/bin/hicBuildMatrix", line 7, in main() File "/home/my_name/.conda/envs/HiCexp/lib/python3.8/site-packages/hicexplorer/hicBuildMatrix.py", line 1290, in main one_mate_lowquality, iternum = readBamFiles(pFileOneIterator=str1, File "/home/my_name/.conda/envs/HiCexp/lib/python3.8/site-packages/hicexplorer/hicBuildMatrix.py", line 710, in readBamFiles assert mate1.qname == mate2.qname, "FATAL ERROR {} {} " AssertionError: FATAL ERROR LH00504:25:22KVL7LT3:7:1101:1148:21124 LH00504:25:22KVL7LT3:7:1101:1148:22470 Be sure that the sam files have the same read order If using Bowtie2 or Hisat2 add the --reorder optionAssertionError: FATAL ERROR LH00504:26:22MMKMLT3:1:1101:16644:1096 LH00504:26:22MMKMLT3:1:1101:17051:1096 Be sure that the sam files have the same read order If using Bowtie2 or Hisat2 add the --reorder option

I used bowtie2 to map paired HiC reads as a single read file with --reorder option: for i in 1 2 ; do bowtie2 -x genome_index -U my read"$i".fq -S read"$i".sam --reorder ; done for i in 1 2 ; do samtools view -Shb read"$i".sam > read"$i".bam ; done *then sorted by -n: for i in 1 2 ; do samtools sort -n -m 4G -o read"$i"_sorted_by_n.bam read"$i".bam ; done

  • I also tried to use without sort -n, or just sort (without -n option).

bowtie2 2.5.4 Python 3.8.19 hicexplorer 3.6 I used mamba to install HiCExplorer.

— Reply to this email directly, view it on GitHub https://github.com/deeptools/HiCExplorer/issues/777#issuecomment-2285962347, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGQCABNDD3ULN7S6UHQUA3ZRHRBRAVCNFSM6AAAAABMOAMQO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBVHE3DEMZUG4 . You are receiving this because you were mentioned.Message ID: @.***>

AkaneKawaguchi commented 1 month ago

Dear Dr. Wolff and team

Thank you so much for your quick response! I had just tested buildmatrics with 3.7.5.

$mamba install hicexplorer=3.7.5 -c bioconda -c conda-forge

Upgrade: ────────────────────────────────────────────────────────────────────────────

ca-certificates 2024.6.2 hbcca054_0 installed
ca-certificates 2024.7.4 hbcca054_0 conda-forge/linux-64 Cached certifi 2024.2.2 pyhd8ed1ab_0 installed
certifi 2024.7.4 pyhd8ed1ab_0 conda-forge/noarch Cached hicexplorer 3.7.4 pyhdfd78af_0 installed
hicexplorer 3.7.5 pyhdfd78af_0 bioconda/noarch 2 MB ────────────────────────────────────────────────────────────────────────────

then, here I just got the same comments:

hicBuildMatrix --samFiles no1_1.clean_N.bam no1_2.clean_N.bam --outFileName HiC_no1_bowtie2-re-rder-local_N_bs0.h5 --QCfolder HiC_no1_bowtie2-re-rder-local_N_bs0 --restrictionCutFile genomic_Dpn2_positions.bed --restrictionSequence GATC --danglingSequence GATC --outBam no1_bowtie2-re-rder-local_N_bs0.bam --minMappingQuality 15 --threads 18

$ INFO:hicexplorer.hicBuildMatrix:Minimum distance considered between restriction sites is 300 Max distance: 1000

INFO:hicexplorer.hicBuildMatrix:dangling sequences to check are {'GATC': {'pat_forw': 'GATC', 'pat_rev': 'GATC'}}

Traceback (most recent call last): File "/home/myname/.conda/envs/HiCexp_new/bin/hicBuildMatrix", line 7, in main() File "/home/myname/.conda/envs/HiCexp_new/lib/python3.10/site-packages/hicexplorer/hicBuildMatrix.py", line 1299, in main one_mate_lowquality, iternum = readBamFiles(pFileOneIterator=str1, File "/home/myname/.conda/envs/HiCexp_new/lib/python3.10/site-packages/hicexplorer/hicBuildMatrix.py", line 719, in readBamFiles assert mate1.qname == mate2.qname, "FATAL ERROR {} {} " \ AssertionError: FATAL ERROR LH00504:25:22KVL7LT3:7:1101:1148:21124 LH00504:25:22KVL7LT3:7:1101:1148:22470 Be sure that the sam files have the same read order If using Bowtie2 or Hisat2 add the --reorder option

Would you find the solution for that?

Many thanks and best wishes, Akane