fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
619 stars 42 forks source link

Read Error on Downloaded File Using Windows #253

Closed andrew-vancil closed 3 years ago

andrew-vancil commented 3 years ago

Hello,

We (@cole-brokamp) are working to develop a package that involves downloading and reading an .fst file from an AWS S3 location. When testing on a Windows PC, we have found that read.fst() produces an error when the input is a file downloaded using download.file() and the method argument is not specified. However, when the method in download.file() is set to curl, read.fst() will read in the file with no issues. When testing with MacOS, there are no errors when curl is not specified. Is it possible to use download.file() and then read.fst() on Windows without specifying the method as curl? I am including a reprex for reference, with a publicly available dataset.

url<-"https://geomarker.s3.us-east-2.amazonaws.com/testing_downloads/mtcars.fst"
destfile<-"mtcars.fst"
download.file(url, destfile, method = "curl")
works<-fst::read.fst("mtcars.fst")

#When "curl" is not specified, read.fst produces an error
download.file(url, destfile)
doesntwork<-fst::read.fst("mtcars.fst")

The error that is returned is as follows:

> doesntwork<-fst::read.fst("mtcars.fst")
Error in read_fst(path, columns, from, to, as.data.table, old_format) : 
  It seems the file header was damaged or incomplete

Thank you so much for your help!

cole-brokamp commented 3 years ago

Thanks, Andrew. An additional note:

Using download.file on Windows should default to method = "auto", which should result in the wininet method being used. It's unexpected that the file contents or file header would be changed when using wininet versus curl to download an fst file, but this appears to be what is happening.

MarcusKlik commented 3 years ago

Hi @andrew-vancil and @cole-brokamp, thanks for reporting the issue and for providing the code snippet!

Indeed, the two download methods seem to yield different results:

url <- "https://geomarker.s3.us-east-2.amazonaws.com/testing_downloads/mtcars.fst"

path_curl <- "mtcars_curl.fst"
path_wininet <- "mtcars_wininet.fst"

# download with two distinct methods
download.file(url, path_curl,    method = "curl")
download.file(url, path_wininet, method = "wininet")

# different file sizes
file.size(path_curl)
#> [1] 2619
file.size(path_wininet)
#> [1] 2643

Where, as demonstrated in your code, the curl method works as expected, but wininet downloads extra bytes. If we take a look at the raw file stream we can see that extra bytes are added after byte 128:

# get raw file contents
raw_curl <- readBin(path_curl, "raw", file.size(path_curl) + 1)
raw_wininet <- readBin(path_wininet, "raw", file.size(path_wininet) + 1)

# first error at pos 129
min(which(raw_curl != raw_wininet[1:length(raw_curl)]))
#> [1] 129

# show
raw_curl[126:170]
#>  [1] 00 00 00 0a 00 0a 00 0a 00 0a 00 0a 00 0a 00 0a 00 0a 00 0a 00 0a 00 0a 00
#> [26] 09 00 09 00 09 00 09 00 09 00 09 00 09 00 09 00 09 00 09 00
raw_wininet[126:170]
#>  [1] 00 00 00 0d 0a 00 0d 0a 00 0d 0a 00 0d 0a 00 0d 0a 00 0d 0a 00 0d 0a 00 0d
#> [26] 0a 00 0d 0a 00 0d 0a 00 0d 0a 00 09 00 09 00 09 00 09 00 09

So, extra 0d bytes are added to the stream. The same happens at position 265 and at a few more positions after that.

I'm not sure why these extra bytes are added, it must be a problem with the wininet implementation. Does this problem only occur when the file is in a AWS S3 location?

thanks

MarcusKlik commented 3 years ago

Internally, fst uses hashes on all the metadata in the file. Because the extra 0d bytes where added to the stream, these hashes don't match the original hashes anymore, and fst throws an error (as it should, as blindly using those bytes would most probably crash the session :-)).

dipterix commented 3 years ago

I got a similar issue before. If guessing correctly, wininet treats 0a as "new line" (see ASCII 0A). However, on windows, new line is \r\n (place cursor at the beginning, and then new line), which is exactly 0d 0a, but on POSIX (linux and osx), return key is just \n. I think wininet was trying to change \n to \r\n here, which is stupid.

This could be a bug of R implementation? Maybe you can file a bug report to the R core team. (If you do, please CC me at zhengjia.wang AT rice.edu. I'm also interested in their replies, thanks :D

(Disclaimer: I'm not the author of this package, just accidentally saw the issue)

cole-brokamp commented 3 years ago

Thanks! This problem does persist with files hosted places other than AWS S3. The problem does appear to lie upstream with the wininet library. We appreciate the help! @andrew-vancil -- let's close this issue since it isn't relevant here anymore.

In the future, we will avoid this problem, by forcing Windows users to use download.file with curl.