Closed andrew-vancil closed 3 years ago
Thanks, Andrew. An additional note:
Using download.file
on Windows should default to method = "auto"
, which should result in the wininet
method being used. It's unexpected that the file contents or file header would be changed when using wininet
versus curl
to download an fst file, but this appears to be what is happening.
Hi @andrew-vancil and @cole-brokamp, thanks for reporting the issue and for providing the code snippet!
Indeed, the two download methods seem to yield different results:
url <- "https://geomarker.s3.us-east-2.amazonaws.com/testing_downloads/mtcars.fst"
path_curl <- "mtcars_curl.fst"
path_wininet <- "mtcars_wininet.fst"
# download with two distinct methods
download.file(url, path_curl, method = "curl")
download.file(url, path_wininet, method = "wininet")
# different file sizes
file.size(path_curl)
#> [1] 2619
file.size(path_wininet)
#> [1] 2643
Where, as demonstrated in your code, the curl
method works as expected, but wininet
downloads extra bytes.
If we take a look at the raw file stream we can see that extra bytes are added after byte 128:
# get raw file contents
raw_curl <- readBin(path_curl, "raw", file.size(path_curl) + 1)
raw_wininet <- readBin(path_wininet, "raw", file.size(path_wininet) + 1)
# first error at pos 129
min(which(raw_curl != raw_wininet[1:length(raw_curl)]))
#> [1] 129
# show
raw_curl[126:170]
#> [1] 00 00 00 0a 00 0a 00 0a 00 0a 00 0a 00 0a 00 0a 00 0a 00 0a 00 0a 00 0a 00
#> [26] 09 00 09 00 09 00 09 00 09 00 09 00 09 00 09 00 09 00 09 00
raw_wininet[126:170]
#> [1] 00 00 00 0d 0a 00 0d 0a 00 0d 0a 00 0d 0a 00 0d 0a 00 0d 0a 00 0d 0a 00 0d
#> [26] 0a 00 0d 0a 00 0d 0a 00 0d 0a 00 09 00 09 00 09 00 09 00 09
So, extra 0d
bytes are added to the stream. The same happens at position 265 and at a few more positions after that.
I'm not sure why these extra bytes are added, it must be a problem with the wininet
implementation. Does this problem only occur when the file is in a AWS S3 location?
thanks
Internally, fst
uses hashes on all the metadata in the file. Because the extra 0d
bytes where added to the stream, these hashes don't match the original hashes anymore, and fst
throws an error (as it should, as blindly using those bytes would most probably crash the session :-)).
I got a similar issue before. If guessing correctly, wininet treats 0a
as "new line" (see ASCII 0A). However, on windows, new line is \r\n
(place cursor at the beginning, and then new line), which is exactly 0d 0a
, but on POSIX (linux and osx), return key is just \n
. I think wininet was trying to change \n
to \r\n
here, which is stupid.
This could be a bug of R implementation? Maybe you can file a bug report to the R core team. (If you do, please CC me at zhengjia.wang AT rice.edu. I'm also interested in their replies, thanks :D
(Disclaimer: I'm not the author of this package, just accidentally saw the issue)
Thanks! This problem does persist with files hosted places other than AWS S3. The problem does appear to lie upstream with the wininet
library. We appreciate the help! @andrew-vancil -- let's close this issue since it isn't relevant here anymore.
In the future, we will avoid this problem, by forcing Windows users to use download.file with curl
.
Hello,
We (@cole-brokamp) are working to develop a package that involves downloading and reading an .fst file from an AWS S3 location. When testing on a Windows PC, we have found that read.fst() produces an error when the input is a file downloaded using download.file() and the method argument is not specified. However, when the method in download.file() is set to curl, read.fst() will read in the file with no issues. When testing with MacOS, there are no errors when curl is not specified. Is it possible to use download.file() and then read.fst() on Windows without specifying the method as curl? I am including a reprex for reference, with a publicly available dataset.
The error that is returned is as follows:
Thank you so much for your help!