DiltheyLab / HLA-LA

Fast HLA type inference from whole-genome data
GNU General Public License v3.0
120 stars 40 forks source link

Very slow process (5 hours) #38

Open lmtani opened 4 years ago

lmtani commented 4 years ago

Hello,

I'm using HLA-LA and everything seems to be fine (so far, we processed 50 exomes). We run tests with NA12878, NA12155, NA19128, NA11892, NA19127, NA19700, NA12400 (samples sequenced here with Illumina) and predictions considering 4 digits were good (the HLA-DRB3/4 were the most problematic as described in the README of this repo).

The problem is: there is one sample that does not finish the process... when I inspect its stdout I found that it took ~5 hours for one step, as pasted below:

processBAM::alignReads_postSeedExtraction_andStoreInto(): Deal 10000 total read pairs with seeds, of which 1676 are incomplete.
 [ Sun Mar  8 00:36:27 2020 ] Read pair 0 of 8324
 [ Sun Mar  8 00:37:22 2020 ] .. done. Processed  112339 in total (total read IDs: 516844.
 [ Sun Mar  8 00:37:22 2020 ] Process 14/51
 [ Sun Mar  8 00:37:22 2020 ]   Start seed extraction
 [ Sun Mar  8 00:37:36 2020 ]           Done extractSeeds2
processBAM::extractSeeds2(): examined 105480 reads, transformed into 10000 seeds.
 [ Sun Mar  8 00:37:36 2020 ]   Alignment
 [ Sun Mar  8 00:37:36 2020 ] Proto-seed statistics:
    Incomplete: 1679
    Complete: 8321
        Average chains per read: 5.44021
        Average chain length: 98.9056
        Average primary chain length: 99.1694

processBAM::alignReads_postSeedExtraction_andStoreInto(): Deal 10000 total read pairs with seeds, of which 1679 are incomplete.
 [ Sun Mar  8 00:37:36 2020 ] Read pair 0 of 8321
 [ Sun Mar  8 05:41:04 2020 ] .. done. Processed  120660 in total (total read IDs: 516844.
 [ Sun Mar  8 05:41:04 2020 ] Process 15/51
 [ Sun Mar  8 05:41:04 2020 ]   Start seed extraction
 [ Sun Mar  8 05:41:18 2020 ]           Done extractSeeds2
processBAM::extractSeeds2(): examined 102128 reads, transformed into 10000 seeds.
 [ Sun Mar  8 05:41:18 2020 ]   Alignment
 [ Sun Mar  8 05:41:18 2020 ] Proto-seed statistics:
    Incomplete: 1722
    Complete: 8278
        Average chains per read: 5.25562
        Average chain length: 99.2039
        Average primary chain length: 99.1174

had anybody seen this problem already?

Here is the whole stdout file. I killed the process because it was not going forward.

stdout.txt

Software versions:

hla_la_version 1.0.1 samtools_version 1.9 bwa_version 0.7.17 picard_version 1.123 bamtools_version 2.5.1

Edit: the analysis took 22 hours to successfully end.

AlexanderDilthey commented 3 years ago

Hi @lmtani,

It seems that there is quite a bit of mapping uncertainty in your data - the value of ~5 in Average chains per read: 5.25562 is relatively high. Do the reads in this sample have the same length and insert size as in your other samples?

lmtani commented 3 years ago

@AlexanderDilthey thank you for the reply.

Yes, all reads have same length (paired-end of 100bp) and mean insert size (~230). Please let me now if you need more logs.

lmtani commented 1 year ago

Hello again, just to let you know that the error still occurs sometimes, but I've already processed thousands of exomes without any other problem.

I'm using machines with 42 GB of memory and 2 CPUs. The exit code when it fails is 137. Maybe more memory is necessary 🤔