fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
618 stars 42 forks source link

"Embedded nul in string" error in large files #119

Open hyokangs opened 6 years ago

hyokangs commented 6 years ago

I tried to load the following (zipped) tsv file with fread and then saved using write_fst (specifying uniform_encoding=F).

http://www.patentsview.org/data/20170808/uspatentcitation.zip or http://www.patentsview.org/data/20170808/rawinventor.zip

Tried to load fst file, but get the following error message:

Error in fstretrieve(fileName, columns, from, to, old_format) : embedded nul in string: 'à«z\0\0\0\0ðÇ\005'

It seems that the specific wording of the error changes every time I try. Sometimes it successfully load the full data without error.

MarcusKlik commented 6 years ago

Hi @hyokangs, thanks for reporting!

I will try to reproduce your error, what OS and R version are you using? Those are very interesting public data-sets by the way!

hyokangs commented 6 years ago

Hi @MarcusKlik,

Thank you for your response. My working environment yesterday was: Windows Server 2012R2 & R 3.3.3. I tried to replicate the error with my laptop, Windows 10 Pro and R 3.3.4., but failed to do so -- meaning, everything worked fine with my laptop.

I will post an update once I encounter the same error again. Thanks!

MarcusKlik commented 6 years ago

Hi @hyokangs,

I managed to download your files and used the following code to test:

library(data.table)
library(fst)

# read and rewrite as fst file
x <- fread("rawinventor.tsv")
write_fst(x, "rawinventor.fst", uniform_encoding = FALSE)

# read from fst file
x_read <- read_fst("rawinventor.fst", as.data.table = TRUE)

# verify equality of both tables
testthat::expect_equal(x, x_read)

# contents
metadata_fst("rawinventor.fst")
#> <fst file>
#> 14959652 rows, 8 columns (rawinventor.fst)
#> 
#> * 'uuid'          : character
#> * 'patent_id'     : character
#> * 'inventor_id'   : character
#> * 'rawlocation_id': character
#> * 'name_first'    : character
#> * 'name_last'     : character
#> * 'sequence'      : integer
#> * 'rule_47'       : integer

Everything seems to work as expected (using the latest CRAN version of fst). Your error looks like a problem with reading the format correctly. Sometimes that happens when you try to read from a fst file that was created with a dev version (< 0.8.0) of the package, does that apply to your case?

Is there a special reason why you are using uniform_encoding = FALSE, you're file doesn't seem to contain rows with different encoding than the first row?

Thanks for testing!

hyokangs commented 6 years ago

Thank you, @MarcusKlik.

I added "uniform_encoding=FALSE" after getting an error message ("Embedded nul in string"). This did not solve the problem, though.

After multiple tests, I think it is a problem of a network drive. Whenever I write to and read from my local drive, no problem occurs. When I try to write to and read from a network drive, I frequently encounter this problem. I believe this is due to slow AND unstable I/O.

Will update you if any problem occurs with my local drive.

MarcusKlik commented 6 years ago

Hi @hyokangs, thanks for testing! Yes, it looks like the byte stream is interrupted somehow when you load from the network drive. Still, normally loading from a network drive should just work (but slower obviously).

During a read or write, fst is firing a lot of short system API calls to the file system. Local (SSD) drives know how to handle (and combine) those effectively, but perhaps your network drive has problems with the high rate of calls. It would be interesting to learn whether the problem occurs during a write or read operation. When your file file is stored, you say that can you read the data correctly some of the times (in that case the error probably occurs in writing) ? Or do you get the same error message for the same file every time you load that file (error probably during read) ?

Giqles commented 6 years ago

I'm having what seems to be the same issue -- it seems to be write related for me. Writing to a network file system sometimes seems to fail silently -- I've found if I write then read until I get one successful read, it seems to be fine from then on.

I can't share my data or environment -- I am using fst v 0.8.2.

Some observations though:

  1. it happened for me with a set of tables that contained quite a few character columns -- approx 300k lines and 30 character columns caused all of my problems so far.

  2. other tables that were much larger had no issues -- 200m rows and 60 numeric columns didn't present any problems so far

If this is related to the byte stream getting interrupted would increasing the amount of compression help reduce the number of errors? (by making the amount to actually write to the file system smaller?)

MarcusKlik commented 6 years ago

Hi @Giqles, thanks for reporting. Yes, apparently the stream gets interrupted by the large number of relatively small writes to the network drive. One of the features that is planned to be released soon is a hash calculation of the actual data blocks that are stored. That would give the user a 'clean' error when incorrect data is read from disk (and could pinpoint the problem to the responsible column and approximate row number). Perhaps that will offer some information as to which column types are lost most often.

Also, writing much larger blocks for 'slow' drives would solve the problem I hope, because network drives usually have larger cache size than local drives.

Perhaps offering the user a 'network' mode with strong compression (as you say) and bigger chunks for writing would be a good option, although I would like to understand what's causing the problem so that we can fix it for any setting.

Giqles commented 6 years ago

I had what might be a related issue -- a file I've worked with a lot got a small subset of dates set (incorrectly) to 1970-01-01 in the stored file on one write. This didn't happen consistently; running exactly the same code again when I spotted the issue didn't result in any noticeable problems with the data.

shrektan commented 5 years ago

@MarcusKlik (I'm using fst vesion 0.8.8.) I was bite by the same error at times... Sorry that I can't reproduce it stably but the error occurs every a few weeks... What weird is, if I delete the file and re-write again, it works fine...

The target table should be around 1000 KBs. However, when the error happens, I will find that the size of the file increases to 3GBs...

The table I write contains some non-ASCII characters. I doubt that relates to the bug although I have other tables contain non-ASCII strings which never had a problem.

The error message is something like below:

   [2018-12-22 03:05:52.219][INFO] synchoronizing tbl `mf_chargeratenew`...
    Error in fstretrieve(fileName, columns, from, to, old_format) : 
      embedded nul in string: '\0\0\0\0'
    Calls: <Anonymous> ... <Anonymous> -> <Anonymous> -> <Anonymous> -> fstretrieve -> .Call
    Execution halted
MarcusKlik commented 5 years ago

Hi @shrektan, thanks for reporting!

Do you also see these occasionally errors when reading from a network drive ? It's difficult to pinpoint the exact problem with the errors reported by @Giqles and @hyokangs. I have been trying to reproduce it by reading large fst files with character columns using various encodings from a network drive but didn't get any errors so far.

But your problem seems to be different with the file blowing up to a large size. Do I understand you correctly that you don't get an error while writing but you see a result file that's much larger than can be expected? And when you read that (large) file, you get the error that you reported above?

thanks for testing, if you would have a (generated) sample file with which you can trigger the (occasional) error, that would be great for debugging!

Giqles commented 5 years ago

In case it helps, the network file system I was using that had this error was a gluster system on AWS (which was the storage for a pool of Rstudio pro servers -- which I assume is a reasonably common set up). It might be a lot of effort to set up that to debug this issue! But perhaps with @hyokangs files in the original issue and that extra info you might be closer to a reproducible version of the bug.

It sounds like @hyokangs network file system was slow and volatile. The gluster system I was using was as well -- I'm not sure how easy it will be to reproduce those issues!

MarcusKlik commented 5 years ago

thanks @Giqles, yes, I think that when at some point in the data transfer from network drives, a few bytes are missed or mis-arranged, fst is not handling that very well.

Because reproducing these issues is very difficult, as you already mention, the best way to tackle them is probably to make fst calculate a hash for each data block written to disk. By verifying the hash before decompressing the data, we can (almost) make sure that the retrieved bytes are identical to those written to disk earlier. The same is already done for all meta-data stored in the fst format. Expanding hashes to column data would certainly add to the stability of fst.

Hashing can be done at high speed using the xxHash algorithm that is part of the ZSTD and LZ4 compression libraries contained within fst. Nevertheless, it will still take a small bite out of the performance, so hashing column data should be a mandatory option I think (see also this comment).

thanks for sharing the details of your setup!

ryankennedyio commented 5 years ago

@MarcusKlik That seems a useful fix. We do occasionally get similar problems.

A possibly related issue we're seeing with a ~30GB file written to network disk is incorrect data. Most columns we write are numeric, and when we read them back out we reasonably often get large slabs of '0' values back (not NULL or NA or errors, just '0').

I've debugged every step of our process quite thoroughly, and immediately prior to writing to disk the data is always correct. If I write the data to disk and read it back out, occasionally we get the '0' problem (on every subsequent read of that problem file it'll happen too, implying it's an issue with the write).

Very inconsistent and extremely hard to reproduce though!

I'll try to write the file temporarily on local disk before moving onto the network drive, instead of directly to the network.

Giqles commented 5 years ago

@ryankennedyio -- that issue of many 0s sounds the same as my 1970-01-01 problem above:

I had what might be a related issue -- a file I've worked with a lot got a small subset of dates set (incorrectly) to 1970-01-01 in the stored file on one write. This didn't happen consistently; running exactly the same code again when I spotted the issue didn't result in any noticeable problems with the data.

I guess a temporary solution for this is to write a function that repeatedly tries to write a file until it gets a successful read that is identical to the in-memory data. Given how much faster fst is than the alternatives it might well still be quicker!?

MarcusKlik commented 5 years ago

Hi @ryankennedyio and @Giqles, thanks for sharing your results!

Indeed, the two problems seem related. Apparently, some writes fail and in those cases, zero's instead of actual data end up in the resulting dataset.

But in your cases, the writes use standard compression (right?), so the zero's could very well be the result of partially or incorrectly decompressed data-blocks. Even a single-byte error in the compressed data block could result in many zero's when decompression is incomplete.

So hashing the (compressed) data block would be a good solution to detect those kind of errors. Also, a comparison of the length of the decompression output against the expected length is required I think. I expected the compressors (LZ4 and ZSTD) to throw an error when input errors are encountered, but perhaps that's not the case in the current setup, some more investigation is needed there...

Thanks for testing, I will make sure more checks are done on the consistency of the data-blocks. Getting occasional errors because of write failures is one thing, but getting zero's in the result without warning is much more worrying!

ryankennedyio commented 5 years ago

Hi @MarcusKlik

Yes, this was showing up using default compression parameters.

Like you say, I was also expecting the compression libraries to be doing hash checks on the data as it was being written to disk, and throwing an error or warning if any issues were encountered. I took it for granted that the underlying I/O libraries would have been performing that.

We're now writing to local disk before performing a mv onto the network drive, so hopefully that workaround is fine for a while.

MarcusKlik commented 5 years ago

Hi @ryankennedyio, thanks, yes, apparently we need more checks to ensure that data is decompressed correctly. A full hash of the data should also be an option, per request of the user (hash_data = TRUE). That would slightly lower performance, but add much to stability.

At the same time, it's still not clear why these write errors occur in some cases, and it would be very valuable to learn the underlying cause!

thanks

ryankennedyio commented 5 years ago

I'd consider that to be the role of the underlying i/o libraries, not this package (ie not your bug!). We haven't seen any issues since writing direct to local disk, but it hasn't been too long yet.

MarcusKlik commented 5 years ago

thanks, please let me know if you encounter similar issues on local writes. I´ll make sure the additional checks are available with the next release of fst!

ryankennedyio commented 5 years ago

Hi @MarcusKlik

Just a notification that we've now noticed this when writing a ~30gb file to local disk. This is the first time I've noticed it in 2 months, rather than happening every 2 days when it was writing to network

But it does seem as though the problem is independent of the storage medium, and is something to do with how the underlying I/O library is handling errors to that storage.

MarcusKlik commented 5 years ago

Hi @ryankennedyio, thanks for reporting!

Yes, so on local disk, this problem occurs with a much lower probability, but it can still happen. That's consistent with the idea that we have a fail in the writing of one or more bytes, as that should be much less probable for local writes.

No errors are detected during the write phase, because those would result in an error being thrown at the time of writing (see this code). So we can only detect the problem at the time of reading (by checking hashes), or at the time of writing by verification of the bytes on disk.

The latter is very slow, reducing the speed with a factor of 2 or more because of the large number of seek operations needed. But write_fst() could provide an argument (e.g. verify = TRUE) for critical writes that need verification.

For cases where it's enough to at least detect that there are incorrect elements in the byte stream, hashing the data blocks is a better solution I think. It's much faster and provides a lot of robustness to the fst file format.

And for long time storage, you'll probably would like to employ both.

I'll try to implement #49 in the next version!

ryankennedyio commented 5 years ago

Agree that hash values would be an ideal balance of performance/safety for this scenario. Thanks again for the great work!

pprado123 commented 2 years ago

hello,

This issue just occurred to me. Same issue when writing a 3 GB file to a network drive. Was this solved? I just had to go back and redo the whole file. Anything I should consider to prevent this issue in the future? Should I talk to our IT about inconsistent I/O calls?