Job progress and diagnostics?

dthorburn commented 1 year ago

Hi, I have been stuck on getting ESPRESSO_S to complete with my dataset. I have 52Gb of filtered ONT fastq data split across 9 samples. I am running ESPRESSO_S on an HPC with 32 cpus and 512Gb of memory. It's been running for over 92 hours, which is over the 72h runtime limit of the HPC and I have to manually extend the runtime of the first job every 24 hours. I've also just started a job with 128 cpus and 2Tb of memory just in case.

I am providing the number of threads using the -T argument, but without any progress information I have no idea how much more I potentially need to provide. But the memory and runtime are starting to make me suspect something else is going on. Is there any way I can diagnose potential issues?

The only output so far are a few of these errors Thread 3 terminated abnormally: ...

Please let me know if you need any other information.

EricKutschera commented 1 year ago

The error handling in ESPRESSO needs some improvement. The error message Thread 3 terminated abnormally could happen for different reasons. If the log doesn't show more details around that error message then my guess is that it is caused by this line: https://github.com/Xinglab/espresso/blob/v1.3.0-beta/src/ESPRESSO_S.pl#L865

That line is checking the chromosome name from the alignment against the chromosome names from the FASTA file given as -F or --fa. That --fa file might differ from what was used to align the reads. You could try editing that line to die "did not find: $line[2]" if !exists $chr_seq_len_ref->{$line[2]}; and then rerunning to get a better message

When the thread terminates abnormally, the main ESPRESSO thread just continues to wait around for it to finish. Your run likely spent most of its 92 hours waiting for a thread that already terminated. For now you can end the job if you see a Thread terminated abnormally error in the log, but we will work on improving the error handling

dthorburn commented 1 year ago

The errors I am getting all look similar, but with different reads highlighted: Thread 3 terminated abnormally: 171: 39602367-ff37-496f-b20a-0d6cd65800bf 0 AgamP4_2L 74045 37 171S6=1X17=4D2=22D11=1X1=1I13=12D10=1X18=1X5=1D27=2D12=1D16=78S * 0 0 TAATGTACTTCGTTCGGTCTCTGGAAATTTGGGTGTTTGCTGATTAATAGCCATGACTTCTCGCAAAGGCAGAAAGTAGTCTTTCTGTTGGTGCTGATATTGCTTTGAGTTCAACTTAGCGTTCCGATTTGGGGATTCCAACCGGAGGAAGGAGAAACAGGTTTAAAAGAACGATTGTCTATAACCAATTGTCGTGCCAGCCAACACAGCAGAGCGTAACAAACAACGCTCTAAAACAACGTTTGGTTTTATGGTTAGCAATCTGATGTTAATTATTACTCAAAGTGTGCTTCCTCACAGTATACGTAAATTAAAAACTTGCGGGCGGCGGACTCTCCTCTGAAGATAGAGCGACAGGCAAGTGACTACTTTCTGCCTTTGCGAGAAGTCAT $$$$$&(()*667::'%%%$###$$%&&)),-)+//4*/*))+//*'&&$$%%$$%)+,0.*''%&),--,,38679989::AGCCCA?600//*())*+9322353.-''2**39;=;;;>;4+(()++-:9?BAAD648==<<<;879?:8:88:97666952/+('$$'-458;<@>????<788;>@D?****309;@ABA>A)(()1225@ADDFADEFEBGFCA<::<GFBA<=<>C44235{?CEACCCDAAACFGDDFEFFKHFDGHEEDD:A?@78<:.,124<:2853&%).1=A===ALH@{KAJIDDEJFA?@?<<<8:?>A>ADAC===>GEI>?9899;:<8878??2322699=?AEEFDD95+)1667+*2862'% NM:i:47 ms:i:101 AS:i:73 nn:i:0 tp:A:P cm:i:9 s1:i:79 s2:i:66 de:f:0.0738 rl:i:0

The names of chromosomes in the fasta and bam are correct, so I doubt that is the issue. I tried removing the 9 reads from the bams where threads crashed in the previous jobs as both jobs crashed on the same reads. However, in the new job another 9 threads crashed directing me to 9 new reads. It's odd that it's always 9 threads that crash, and within the first few minutes of the job. This seems independent of the number of threads.

EricKutschera commented 1 year ago

Based on that error message the thread is failing at this line: https://github.com/Xinglab/espresso/blob/v1.3.0-beta/src/ESPRESSO_S.pl#L1108

For that read it calculates a read length of 171 but the sequence is actually length 392. ESPRESSO is not expecting to see the '=' and 'X' cigar operations and doesn't handle them correctly: https://github.com/Xinglab/espresso/blob/v1.3.0-beta/src/ESPRESSO_S.pl#L1167 The error handling could definitely be improved to mention the unhandled cigar operations

To get the current ESPRESSO code to run on your data you could realign the reads with different parameters so that it outputs M instead of = and X cigar operations. Hopefully we can update the ESPRESSO code to handle those operations

dthorburn commented 1 year ago

In case anyone else uses a mapper that uses SAM format v1.4+, there is a pretty simple solution from the BBMap Suite.

reformat.sh in=data.bam out=data_reformat.bam sam=1.3

Where the docs for sam=1.3 is Set to 'sam=1.3' to convert '=' and 'X' cigar symbols (from sam 1.4+ format) to 'M'.

Xinglab / espresso

Job progress and diagnostics? #9