FFTW / fftw3

DO NOT CHECK OUT THESE FILES FROM GITHUB UNLESS YOU KNOW WHAT YOU ARE DOING. (See below.)
GNU General Public License v2.0
2.73k stars 664 forks source link

fftw: planner.c:891: assertion failed: flags.u == u #214

Open dl9rdz opened 4 years ago

dl9rdz commented 4 years ago

I am using libfftw3 as part of the jt9 program of wsjtx (https://physics.princeton.edu/pulsar/K1JT/wsjtx.html) Platform is a Raspberry PI4 with standard libfftw package version 3.3.8-2, wsjtx 2.2.2

Occasionally, I get this error message when launching the jt9 program:

fftw: planner.c:891: assertion failed: flags.u == u

Program received signal SIGABRT: Process abort signal.

This seems to depend on the wisdom file (jt9_wisdom.dat). Recently the system went into a state where it always SIGABRTed with above error. Deleting the wisdom file removed that problem.

I tried to understand what is going on and found this piece of code at planner.c:891

      flags.u = u;
      flags.timelimit_impatience = timelimit_impatience;
      flags.hash_info = BLESSING;

      CK(flags.l == l);
      CK(flags.u == u);

How is it possible at all that the assertion in the last line fails!? Why is the assertion there in the first place, is there any reason the developers anticipated that something might go wrong here?

matteo-frigo commented 4 years ago

The field flags.u is a 20-bit quantity "unsigned u:20", whereas the local variable "u" is plain unsigned. The assertion checks that we don't overflow the 20-bit field.

Now the question is, how did a >20 bit field end up in the wisdom file? The FFTW logic is pretty straightforward in this respect: we just set a bunch of bits. Maybe there is a compiler bug that sign-extends bit 20 into 32-bits, or maybe I am misunderstanding the C standard and the sign extension is allowed, in which case this would be a FFTW bug (that didn't affect anybody in 20 years).

dl9rdz commented 4 years ago

Thanks for the explanation! That was a great help in understanding what is going on.

A possibly important detail is that I run multiple jt9 processes concurrently, and I assume that each process reads the wisdom file on startup and saves it before termination.

Might it be possible that concurrent access to the wisdom file (one process reads the file, overlapping with the write operation of another process) causes reading corrupt wisdom data? That would explain the "occasional" errors. And two concurrent wisdom save operations on the same file might have caused the persistent corruption of the file.

matteo-frigo commented 4 years ago

The FFTW planner is not thread-safe, as documented in the manual, and therefore concurrent planner invocations will definitely corrupt its state.

I have to say that it is hard to imagine how data races could cause this particular corruption, though. (Basically, the routine that outputs the wisdom file reads 20 bits from memory. They may be the wrong 20 bits, but I don't see how you can end up with more than 20 bits unless somehow you have multiple threads writing to the same wisdom file.) Does the problem disappear if you run single-threaded or if you add a mutex around invocations of the FFTW planner?

stevengj commented 4 years ago

The FFTW planner is not thread-safe

If this is multiple processes (separate address spaces) rather than multiple threads, you should be okay as far as the planner is concerned.

It seems possible that there could be some wisdom-file corruption from concurrent reads/writes by multiple processes.

If that's the problem, the right thing to do is probably the "atomic rename" technique: something like:

char name[] = "/tmp/wisdomXXXXXX";
FILE *f = fdopen(mkstemp(name));
fftw_export_wisdom_to_file(f);
fclose(f);
rename(name, "...actual wisdom file...");
maltium commented 3 years ago

I also got the same error. I'm using Rubber Band (https://breakfastquay.com/rubberband/) which uses FFTW under the hood and this issue randomly happens when running multiple processes.

stevengj commented 3 years ago

It looks like rubberband saves and loads wisdom frequently, so it might be running into an issue with the wisdom file getting corrupted by concurrent reads and writes.

I wonder if we should do the atomic-rename trick internally in FFTW to help protect people doing this? It would be nice to have some confirmation that this is, in fact the problem — if you patch rubberband to use the above trick, does the problem go away?

g4wjs commented 3 years ago

Just to clarify, the WSJT-X application when running multiple instances will have unique FFTW3 wisdom files for each process, so there should be no opportunity for corruption by multiple processes accessing the same wisdom files concurrently.

BiplabRaut commented 3 years ago

Hi Matteo and Steven,

Recently, I encountered this issue while running VASP (Vienna Ab initio Simulation Package) with FFTW.

When wisdom feature (import and export APIs) is used, it results in the corruption of wisdom file content and throwing of this assertion error. VASP application is run in a pure MPI mode wherein multiple MPI processes invoke single-threaded FFTW's APIs for setup() and execute(). If wisdom APIs and the wisdom feature are not used, then this issue is never encountered.

When running QE (Quantum ESPRESSO) application with wisdom support, it does not result in such an error.

Thanks.

stevengj commented 3 years ago

Try patching VASP to use the atomic-rename trick so that multiple processes don't overwrite the wisdom file and corrupt it? Or only write the wisdom file from a single process?

Again, if multiple processes are concurrently writing the wisdom file in an unsafe manner, that is an application error…