fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
614 stars 42 forks source link

R crashes while reading an fst file #271

Open sanjmeh opened 1 year ago

sanjmeh commented 1 year ago

A simple fst read can send R crashing down, if the file is corrupted !

How could a data file be so bad that it sends R crashing? Perhaps the fst read function has some aggressive memory management that interferes with the OS.

To replicate, just executing a simple

fst(filename)

And you will get:

<fst file>
323140 rows, 4 columns (1204011660.fst)

And then a series of error messages, followed by R crashing.

[2706278:2706278:20221107,172349.750548:ERROR process_memory_range.cc:86] read out of range
[2706278:2706278:20221107,172349.750641:ERROR elf_image_reader.cc:558] missing nul-terminator
[2706278:2706278:20221107,172349.750779:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.754375:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.754446:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.754496:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.754544:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.754599:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.754816:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.755118:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.755175:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.755228:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.755292:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.755729:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.755814:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.755867:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.755921:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.755983:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756097:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756154:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756204:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756255:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756320:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756367:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756419:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756469:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756521:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756573:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756619:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756669:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756716:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756769:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756819:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756873:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756923:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.756976:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.757028:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.757079:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.757193:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.757244:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.757325:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.757375:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.757425:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.757472:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.757521:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.757578:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.757630:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.757683:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.757733:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.757785:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.757837:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.757893:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758081:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758130:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758180:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758228:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758311:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758359:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758401:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758456:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758506:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758557:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758610:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758658:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758721:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758765:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758840:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758917:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.758996:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.759050:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.759100:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.759149:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.759200:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.759249:ERROR elf_dynamic_array_reader.h:61] tag not found
[2706278:2706278:20221107,172349.760165:ERROR file_io_posix.cc:140] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[2706278:2706278:20221107,172349.760187:ERROR file_io_posix.cc:140] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
[2706278:2706280:20221107,172349.766128:ERROR directory_reader_posix.cc:42] opendir: No such file or directory (2)

I have uploaded the offending file here. https://drive.google.com/file/d/1hYJLAcqct_5JxTNNXN1c-qKH9bWFhgmO/view?usp=sharing

eddelbuettel commented 1 year ago

Can you please try to turn this into self-contained reproducible example with a script creating a file which subsequently crashes R?

No sane person will read a random binary file off the internet.

sanjmeh commented 1 year ago

@eddelbuettel : thanks for looking at this. Generating the corrupted file with a script looks very difficult because currently there are thousands of fst files created/ overwritten through a crontab scheduler that runs every minute (IOT data keeps coming from thousands of vehicles and we store their tracking, and fuel level data in fst files). A corruption happens in around 1 in thousand writes, and we donot (yet) know how that corruption happens. It is a random event. I suspected the multi core read/write of fst was creating impossible memory allocations but that was just a hunch. I really donot know how to recreate the corruptions using a script.

To prevent multicore I have also added the folowing two lines, as recommended in one of the github issue threads.

fst::threads_fst(nr_of_threads = 1)
fst::threads_fst(reset_after_fork = F)

But I still regularly get the corruptions and the resultant crashes.

MarcusKlik commented 1 year ago

Hi @sanjmeh, thanks for reporting. And I will definitely adhere to @eddelbuettel's warning to not try to load your binary file :-)

In the fst format, all meta-data is hashed. So if this data becomes corrupted for some reason, it's extremely unlikely that the file will read without throwing a (friendly) error. Obviously, a maleficent agent could alter the metadata and the stored hashes to overcome this problem and mess up a file read.

The metadata determines how much memory is allocated for storing the result table. However, the actual column data is decompressed from data blocks in the file using zstd or lz4 decompression. In rare cases, malformed data blocks can cause a crash in those libraries during this decompression phase.

To remedy, we could use safe versions of the lz4 and zstd decompression functions, but these will destroy the performance.

Alternatively, fst could provide an option to hash the datablock as well (something like write_fst(x, path, hash_data = TRUE)). For these hashed files, reads could be done using read_fst(path, check_hashes = TRUE)) for example.

This will have a smaller impact on performance and could be used for files read from internet or other suspicious sources (and would need to be done only once after downloading).

sanjmeh commented 1 year ago

Thank you @MarcusKlik and welcome back to your own repository. That was indeed a long break and I was afraid if you would be back soon. Now on your suggested path:

Alternatively, fst could provide an option to hash the datablock as well (something like write_fst(x, path, hash_data = > TRUE)). For these hashed files, reads could be done using read_fst(path, check_hashes = TRUE)) for example.

I donot see the hash_data argument in write.fst().. I presume you are proposing this functionality and it is not existing in the current version - the feature to hash data.

Meanwhile I will test the first alternative:

To remedy, we could use safe versions of the lz4 and zstd decompression functions, but these will destroy the performance.

If you may please specify how to try the safe options, it will be helpful, as I cannot locate the arguments till now.

By the way can I request you to have a look at the fst file I attached and not treat it as any random binary file from the internet. I am here to claim that it is originating from my system, and not from the internet :-)

eddelbuettel commented 1 year ago

@sanjmeh As another open-source volunteer I am am a little surprised by your tone. We give you our labor for free.

sanjmeh commented 1 year ago

Oh my! my intention is not at all to offend you guys. You are doing a fantastic job in the open source community of R, and so would never want to turn you away. I hope I am making the fst package more popular by asking to make it more robust. Let me know what was hurtful. thanks.

MarcusKlik commented 1 year ago

Yes, unfortunately time is a scarce resource that can only be spent once (except for @eddelbuettel, my theory is that Dirk is somehow able to clone himself into identical copies that can do work in parallel, proof pending...) :-)

About your file @sanjmeh, I will scan the metadata from a container and take a look where things go wrong.

sanjmeh commented 1 year ago

Hi Marcus, any progress on the bug?

jfdesomzee commented 12 months ago

Hello,

I'm suffering from this bug too. Never had an issue before it appears when multiple machines started to write files on the shared drive. Is there a way to test the file before trying to load it? Whenever I read a corrupted file I R crashes if I could get an error instead my problems would be solved. fst rocks I want to keep using it. Please help. And thank you for the good job

AntonWijbenga commented 5 months ago

I have previously encoutered the error as well and today again. I suspect the .fst file becomes corrupt during a 'forced' system reboot on a Windows machine (which is a secondary solution on premise, primary/production is running in the cloud on Ubuntu).

I can read the metadata of the .fst file fine, but reading the whole file causes R to crash. I would be great if somehow this just results in an error instead of crashing R. I'm happy to provide the .fst file if needed for testing.

Otherwise the fst package is great and so far I haven't encountered a better alternative (except for maybe parquet because of cross languate (i.e. Python) support).

jfdesomzee commented 5 months ago

I switched from fst to qs. About the same perfomances, a bit faster. Only you need to read the whole file you cannot query rows or columns. But you can store any R object and store attributes.

sanjmeh commented 5 months ago

I switched from fst to qs. About the same perfomances, a bit faster. Only you need to read the whole file you cannot query rows or columns. But you can store any R object and store attributes.

And what is its advantage over RDS files?

eddelbuettel commented 5 months ago

@sanjmeh Start here: https://github.com/traversc/qs

qs and fst are both very good and improve over rds files which themselves are good and portable across R installations.

AntonWijbenga commented 5 months ago

Thank you for the tip. However, the ability to read only certain rows or columns is one of the main reasons I use the fst package.

I have matrices with measurements for each minute for a certain number of sensors. As a result I have matrices that are 1.440 (number of minutes in a day) x 18.000 or 80.000 (depending on the sensor type). Using these daily matrices and their pivoted clones, I can very quickly read just one minute of a specific day (the date is the filename, minute the n.th column) or read the 24 hour series of a sensor (again the date is the filename and the column name the ID of the sensor).

Reading such a column (or a set of hem) only takes a few milliseconds. Reading an entire year of a couple of sensor data (using their ID's) is done in a couple of seconds. It is very quick to create certain aggregates (over time) that way.

The same is true for reading several minute data for all sensors. For example, you can very quickly calculate a typical (average) value for a tuesday 11:00 based on a set of previous tuesdays (also 11:00).

The entire dataset is historically available from 2018 and is still updated every minute. It is about 500GB (compressed) and stored on SSD based storage (FSx for Lustre at AWS). Results are presented through a dashboard.

For these purposes it is simply way too slow to read the whole matrix every time. With the solution above, I can read in the 'sensor' dimension and 'time' dimension very quickly no matter if it is about recent or older data (no caching needed). I have also tested databases, but they are either too slow or too costly.

sanjmeh commented 5 months ago

I have matrices with measurements for each minute for a certain number of sensors. As a result I have matrices that are 1.440 (number of minutes in a day) x 18.000 or 80.000 (depending on the sensor type). Using these daily matrices and their pivoted clones, I can very quickly read just one minute of a specific day (the date is the filename, minute the n.th column) or read the 24 hour series of a sensor (again the date is the filename and the column name the ID of the sensor).

I have exactly the same application and we also started with fst package for exactly this reason. But I now have moved to mariadb due to this occassional corruption of the fst file. We use RDS for data upto 100 MB and move the data to RDBMS with primary index as time stamp so can quickly query a specific time range.