fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
614 stars 42 forks source link

segfault reading (possibly corrupt) fst file on Linux/Mac #233

Open lhenneman opened 4 years ago

lhenneman commented 4 years ago

Hi,

Love the package! But I've got a sticky problem. I wrote a program that writes millions of .fst files in parallel. Some of these, it turns out, are corrupt (possibly because of issues with the scratch file system I'm using, possibly because of an accident writing the same file from parallel processes). There are no differences in the metadata between uncorrupted and corrupted files.

When I try to read in these corrupt files on my Linux cluster, I get a segfault error that I can repeat on my Mac. I'd like to be able to test for whether the file is corrupt or not so that I can avoid the segfault crash.

Here's an example corrupted file: https://www.dropbox.com/s/eyoaq7nsxp8hhrl/corrupt.fst?dl=0

MarcusKlik commented 4 years ago

Hi @lhenneman, thanks for submitting your issue and making the corrupt file available for investigation!

I will try and find the exact problem with this particular file and get back to you on that.

Are you using parallel R processes to write the files from the cluster (for example with parLapply) ? In principle, I would expect the write lock on files that are open for writing to prevent access to the same file from different nodes in the cluster, what kind of accident are you referring to exactly?

thanks!

renkun-ken commented 4 years ago

It looks similar with #214.

MarcusKlik commented 4 years ago

Hi @lhenneman,

I checked your file and from the debugger I can see that your file has 5 columns and 11319 rows, is that correct?

Upon reading the first 32 kB block of the second (double) column, decompression fails.

The file position of this compressed block is retrieved from the metadata. This data is hashed correctly, so that must mean that fst has written the file to the end and calculated the correct hashes for the metadata.

But still the on-disk data inside the compressed block contains errors. That data is not (yet) hashed, so when writing, there is no check if the compressed in-memory data is actualy equal to the data stored on disk.

Basically this means that there was an error writing that particular block to file. And that error went undetected during the write phase. And because data blocks are not hashed, we can't check on loading whether there are errors in the on-disk block.

Subsequent decompression (with LZ4 in this case) crashes the system, as it's not robust to corrupt input.

The only way to solve these problems is to actually hash the compressed blocks (that's fast but still takes a small bite out of the speed). Or leave it up to the user whether data blocks should be hashed...

I will check further if there is something more that we can do during the write phase to catch these errors!

lhenneman commented 4 years ago

Thanks so much for looking into this and your helpful information! For your questions:

1) These files are written from independent R processes, so there aren't wrappers like parLapply or mclapply to protect the files independent of any system protections. But I'm writing these ~4.5 million .fst files from 500 processes, so the likelihood that two processes would be writing to the same file is small. I think the more likely cause is an issue with the file system I'm writing to (it's been having problems handling traffic recently). 2) Yes, 5 columns and 11319 rows is correct.

Would the idea behind a potential user option to hash the compressed blocks be that they could use this option to be sure to get uncorrupted files (at the expense of a little speed)?

MarcusKlik commented 4 years ago

Would the idea behind a potential user option to hash the compressed blocks be that they could use this option to be sure to get uncorrupted files (at the expense of a little speed)?

yes, exactly, when loading a datablock from a fst file into memory, the hash can be calculated. If that hash corresponds to the hash that was stored in the file itself, we can (almost) be sure that the data was not corrupted in the read write cycle....