drowe67 / LPCNet

Experimental Neural Net speech coding for FreeDV
BSD 3-Clause "New" or "Revised" License
68 stars 25 forks source link

Illegal instruction exception inside nlp_create() on multiple platforms #47

Closed tmiw closed 1 year ago

tmiw commented 1 year ago

There have been reports of recent freedv-gui builds crashing on startup inside the code brought in by #43. Example crash dump from macOS (posted to the freetel-codec2 mailing list):

Process:               FreeDV [1050]
Path:                  /Applications/FreeDV.app/Contents/MacOS/FreeDV
Identifier:            org.freedv.freedv
Version:               ??? (1.8.5)
Code Type:             X86-64 (Native)
Parent Process:        ??? [1]
Responsible:           FreeDV [1050]
User ID:               501

Date/Time:             2022-11-15 16:57:48.889 +1100
OS Version:            Mac OS X 10.15.3 (19D76)
Report Version:        12
Anonymous UUID:        E215C155-F950-4ED2-AF3A-7CB392630AD3

Time Awake Since Boot: 31000 seconds

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_INSTRUCTION (SIGILL)
Exception Codes:       0x0000000000000001, 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Termination Signal:    Illegal instruction: 4
Termination Reason:    Namespace SIGNAL, Code 0x4
Terminating Process:   exc handler [1050]

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   liblpcnetfreedv.0.4.dylib       0x000000010512c2ce __codec2__nlp_create + 254
1   liblpcnetfreedv.0.4.dylib       0x000000010511d16c codec2_pitch_create + 108
2   liblpcnetfreedv.0.4.dylib       0x0000000105121b6f lpcnet_dump_create + 159
3   liblpcnetfreedv.0.4.dylib       0x00000001051239e1 lpcnet_freedv_create + 33
4   libcodec2.1.0.dylib             0x0000000104fc78d7 freedv_2020x_open + 1095
5   libcodec2.1.0.dylib             0x0000000104fc1775 freedv_open_advanced + 101

and one from Linux at https://github.com/drowe67/freedv-gui/issues/292. At least with the latter, the workaround seems to be for the user to build freedv-gui/codec2/LPCNet themselves using the provided build scripts, so the issue seems to be build related and may be able to be mitigated with CMake changes.

Unfortunately I'm not able to duplicate this locally thus far and attempted debugging with libasan instrumentation doesn't seem to make it any easier to do so, either. Help along those lines would be good.

tmiw commented 1 year ago

The crash seems to be happening here per the previous stack traces:

    for(i=0; i<m/DEC; i++) {
    nlp->w[i] = 0.5 - 0.5*cosf(2*PI*i/(m/DEC-1));
    }

I don't think there's any sort of buffer overflow happening here but I could be missing something. Per the mentioned PR we did change at least one definition from COMP to float, but it doesn't look like struct NLP was changed.

drowe67 commented 1 year ago

Yes it's a strange one. That code hasn't changed in years but I guess you never know with C. If the compiler builds it OK, it's hard to see how it could be an illegal instruction. Perhaps the memory containing this code is being overwritten somehow.

tmiw commented 1 year ago

I disassembled the code around that area and it seems to be using AVX2 instructions:

    0x103d712f3 <+291>: vbroadcastsd 0x4074(%rip), %ymm1       ; pitch_gain_cb + 560
    0x103d712fc <+300>: vmovaps %ymm1, 0x80(%rsp)
    0x103d71305 <+309>: vbroadcastsd 0x3992(%rip), %ymm1       ; eband5ms + 64
    0x103d7130e <+318>: vmovaps %ymm1, 0x60(%rsp)
    0x103d71314 <+324>: vbroadcastss 0x358f(%rip), %xmm1
    0x103d7131d <+333>: vmovapd %xmm1, 0x50(%rsp)
    0x103d71323 <+339>: nopw   %cs:(%rax,%rax)
    0x103d7132d <+349>: nopl   (%rax)
    0x103d71330 <+352>: vmovdqa %xmm0, 0x40(%rsp)
->  0x103d71336 <+358>: vcvtdq2pd 0x40(%rsp), %ymm0
    0x103d7133c <+364>: vmulpd 0x80(%rsp), %ymm0, %ymm0
    0x103d71345 <+373>: vdivpd 0xa0(%rsp), %ymm0, %ymm0
    0x103d7134e <+382>: vcvtpd2ps %ymm0, %xmm0
    0x103d71352 <+386>: vmovapd %xmm0, 0x20(%rsp)
    0x103d71358 <+392>: vzeroupper 
    0x103d7135b <+395>: callq  0x103d746e4               ; symbol stub for: cosf

Considering that thus far, people have been able to work around the issue by building freedv-gui themselves, I wonder if we need to disable AVX2 in LPCNet for the binaries that we put out? I don't know how much of a performance hit that'd be, though.