Open WYChen-penguin opened 7 months ago
The FATAL error seems justified, as apparently you are getting two different read IDs on consecutive lines:
A00977:664:H535JDSX7:3:1219:10402:31563_1:N:0:AAGACCGT+CAATCGAC
A00977:664:H535JDSX7:3:1219:10963:31563_1:N:0:AAGACCGT+CAATCGAC
Just to recap: this is genome-wide alignments without any subsequent filtering for specific regions or the like? In theory, the deduplication in paired-end mode should also fail in a similar way (I am currently unsure if it also checks if you specify -p
manually rather than let it detect the mode automatically (you could test this by running it once more without -p
:
deduplicate_bismark -o dedup/TNPS2_dedup.bam TNPS2/TNPS2_R1P_bismark_bt2_pe.bam
Given that you have a substantial amount of discrepancy between the OT and OB strands however, my guess would be that you ran out of memory during a later stage of the alignment process somehow. Maybe you could test whether only valid pairs are present in the original BAM file? You could also attempt to run the extractor on the un-deduplicated file to see it that succeeds (my guess is that it will also fail). If it does then it might have to do with the deduplication, did you get a complete deduplication report?
I would personally probably restart the alignments first thing (making sure you've got enough resources!), and go on the error hunt once that's under way.
I have submitted another run of alignment (a week ago) but it will take a long time to wait in line... and... yes it is a genome-wide alignment we will study certain regions, but we would like to discover unknown regions too.
Here is the deduplication report
Total number of alignments analysed in /align/TNPS2/TNPS2_R1P_bismark_bt2_pe.bam: 287153647
Total number duplicated alignments removed: 3739294 (1.30%)
Duplicated alignments were found at: 3543125 different position(s)
Total count of deduplicated leftover sequences: 283414353 (98.70% of total)
The overall pipeline I follow for the WGBS is: https://link.springer.com/protocol/10.1007/978-1-0716-2067-0_23
where it suggests a flag '-p' for pair-end.
Thanks for the extra information. If you leave out the -p
it will attempt to auto-detect whether it is single- or paired-end, and then set it automatically. I don't think the deduplication script keeps track of different read ID for subsequent lines as the methylation extractor does, so it might not produce the same error after all.
I think you might be able to come up with a quick one liner (in awk
?) that will compare the read IDs of consecutive lines and throw an error once they diverge. Should be quick to do....
Hi, I tried dedup step without '-p' which didn't work... the extractor is killed too. I am not sure if I could figure the awk line... I am a biologist with little coding skill.
Have you tried re-running the alignments? You might try running only a small amount of your files (e.g. -u 1000000
to align just the first 1 million read pairs. This would allow you to assess if you would expect a 50/50 split of the OT and OB strands. If you find a 50/50 split in such a test, I would not bother 'rescuing' the previous run, something went wrong somewhere.
If you are 'just a biologist' you could really find chatGPT very useful for such a task. I haven't tested it but in my experience it will either work out of the box, or require little refinement. I can of course understand if you would rather be able to do this yourself, or not at all. Just fyi, I asked the following:
write a python script that reads in a BAM file, and processes 2 lines at a time. It needs to compare read IDs of two consecutive lines, and die if the IDs are not identical. In that case, print the line numbers and both lines of the BAM file. Accept the BAM file as a command line argument, and use samtools to read it in.
import pysam
import sys
def read_bam_file(file_path):
try:
bamfile = pysam.AlignmentFile(file_path, 'rb')
prev_read = None
line_number = 0
for read in bamfile.fetch(until_eof=True):
line_number += 1
if line_number % 2 == 0: # Process two lines at a time
if prev_read.query_name != read.query_name:
print(f'Error: Read IDs do not match at lines {line_number-1} and {line_number}')
print(f'Line {line_number-1}: {prev_read}')
print(f'Line {line_number}: {read}')
sys.exit('Terminating due to non-matching read IDs.')
prev_read = read
except FileNotFoundError:
print('File not found. Please check the file path and try again.')
except Exception as e:
print(f'An error occurred: {e}')
if __name__ == '__main__':
if len(sys.argv) < 2:
print('Usage: python script.py <BAM file path>')
else:
read_bam_file(sys.argv[1])
I have read many closed issues with a similar problem: #211 #219 #360
I used the module already set up in the supercomputer (ver. 0.24.1).
Here is one of the FATAL errors: [FATAL ERROR:] The IDs of Read 1 (A00977:664:H535JDSX7:3:1219:10402:31563_1:N:0:AAGACCGT+CAATCGAC) and Read 2 (A00977:664:H535JDSX7:3:1219:10963:31563_1:N:0:AAGACCGT+CAATCGAC) are not the same. This might be the result of sorting the BAM files by chromosomal position or merging several files with Samtools sort, and this is not compatible with correct methylation extraction. Please use an unsorted file instead or sort the file by name using the command 'samtools sort -n'. Paired-end files may be merged properly (without risking this error) using either 'samtools merge -n' or 'samtools cat'.
I did "grep" the two reads: A00977:664:H535JDSX7:3:1219:10402:31563_1:N:0:AAGACCGT+CAATCGAC 99 chr5H 568314201 42 136M = 568314229 164 TTATTTTAATTTGTAATGATTTATTTATGGTTGTTTAATAAAAAAAATGTGAGTAATATAGAATTTATGGATATGAAAAGGTAAATTTGTTTTTGAGATGATTTTTATAGTGTTGTTGTTTGTTTTTATTTATTTA FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFF,:FF:FFF:FFFFFFFFFFFFFFFFFFF NM:i:24 MD:Z:1C4C2C10C0C3C7C19C2C7C0C20C2C2C10C0C0C12C3C2C0C1C4C0C1 XM:Z:.h....h..h..........hh...h.......h...................h..h.......hh....................x..h..x..........hhh............h...h..hh.h....hh. XR:Z:CT XG:Z:CT A00977:664:H535JDSX7:3:1219:10963:31563_1:N:0:AAGACCGT+CAATCGAC 147 chr3H 306704954 42 136M = 306704693 -397 TTGTTTGTATTAAACGTTATTTCGTAACTGGTTGATTATAAAGGTGTTTTACAGGTATTTCTGAGGGTGTTTTTTGGGTTGGTATGGATCGAGATTGGGATGTTTTTCTGTATGACAGAGAGGTATTTTTGGGTTT F,FF,FFFF:FF:::FFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF,FF:FF:FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFF NM:i:19 MD:Z:10C6C1C16C9C1C9C11C1C9C11C8C0C0C20C1C4C0C0C0 XM:Z:..........h...Z..h.h..Z....X........h.........h.h..X......h.X.........h.h.........h......Z....x........hhh.X.......X..........h.x....hhh XR:Z:GA XG:Z:CT
Thus, I guess the problem is similar to #211. However, I am doing at the whole genome scale so I can't crop a region to do so. This only happened to one of my four files. I went back to check the file after the aliment, I found this number discrepancy, which is much more than the others:
Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 127321563 ((converted) top strand) GA/CT/CT: 0 (complementary to (converted) top strand) GA/CT/GA: 0 (complementary to (converted) bottom strand) CT/GA/GA: 159832084 ((converted) bottom strand)
I am not sure if this is the true cause or not so I need some suggestions to solve or bypass this...
command used: bismark --genome GP.fa -o TNPS2 -1 TNPS2_R1P.fastq -2 TNPS2_R2P.fastq --parallel 3 deduplicate_bismark -p -o dedup/TNPS2_dedup.bam TNPS2/TNPS2_R1P_bismark_bt2_pe.bam bismark_methylation_extractor --ignore_r2 2 --parallel 16 -o /Ext/TNPS2 --samtools_path dedup/TNPS2_dedup.bam