Segmentation fault in hmmsearch when using a large HMM to search a large protein

apcamargo commented 3 years ago

I built a long HMM from an alignment retrieved from EggNOG v5. The HMM is pretty long (66,439) and I get a segmentation error when querying it against sequences generated from the profile profile itself (using hmmemit).

I'm using hmmsearch version 3.3.2. The HMM and one test sequence are attached for reproducibility.

npcarter commented 3 years ago

Hello. One question, as we start to look into this. How much RAM does the machine you’re running this on have? HMMER can use significant memory when, processing large sequences, so it’s possible that you are running out of RAM.

apcamargo commented 3 years ago

This specific example that I've sent was executed in a 2TB node.

npcarter commented 3 years ago

FYI, I have been able to reproduce this on our systems, and am starting to debug it.

-Nick

On Fri, Sep 10, 2021 at 10:02 PM Antônio Camargo @.***> wrote:

This specific example that I've sent was executed in a 2TB node.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/EddyRivasLab/hmmer/issues/260#issuecomment-917318734, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDJBZCJIFYLE2NKTNSAXILUBK2BPANCNFSM5D2MVZBA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

npcarter commented 3 years ago

Sean, I think this one should to be on your plate, as it requires a better understanding of the main stage than I have. Here's what I've figured out so far:

The original segfault is occurring on line 327 of fwdback.c. It's occurring because dpc is being set to NULL. ox->allocR is 80763 (correct), but ox->validR is 16118 (incorrect). Thus, when i=16118, the dpc = ox->dpf[do_full * i]; computation on line 297 runs off the end of the list of allocated row pointers and returns garbage.

That appears to be happening because some of the computations in p7_omx_GrowTo are overflowing 32-bit integers with M= 66439 and L=80762 as found in the test. In particular ncells is overflowing, which seems to be causing the code to not decide that it needs to allocate more cells, leading to ox->allocR being correct but ox->validR being too small.

My first thought was to replace the computation of ncells and the arguments to ESL_RALLOC with code that explicitly casts all of the integer values to size_t to avoid the overflow, but that doesn't fix the problem. Instead, the test crashes because it's passing a NAN to p7_FLogsum.

The good news is that they've sent us an easy-to-replicate bug report, and one that reproduces under GDB. To reproduce: grab and unzip their attached zipfile, and run hmmsearch EggNOG_Bac_COG5184.hmm generated_sequence.faa.

If you want me to keep banging on this, I can, but I think you'll have a much easier time taking it from here than I would.

cryptogenomicon commented 1 year ago

I think this is fixed now, by commit 68f638a in our develop branch.

This is a ginormous DP matrix: 5.4G cells. Yes, the problem was 32-bit overflows in a couple of places in how we allocate and lay out our DP matrix memory.

I made the same sort of changes (though using int64_t, not size_t). I didn't see the NaN/p7_Logsum problem you did. The casts have to be not only in the alloc/realloc size calculations, but also in where we set dp row pointers into that allocated memory; the offsets for those row ptrs also need to be calculated in int64_t. If you only did the alloc/reallocs, that alone wouldn't fix it, and whatever happened with the NaN might've been the result somehow?

Because the issue is in the vector code, I made the same changes in our NEON and ARM implementations.

EddyRivasLab / hmmer

Segmentation fault in hmmsearch when using a large HMM to search a large protein #260