Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 986 forks source link

fread does not have an option to solve unzipping problems with zip ill-formatted name files in Windows #5237

Open fabiocs8 opened 3 years ago

fabiocs8 commented 3 years ago

As per my post in SO, fread cannot import and unzip the following URL:

dt <- fread("https://www.portaltransparencia.gov.br/download-de-dados/despesas-execucao/202001")

The work around was to read the url imposing mode = "wb" : download.file("https://www.portaltransparencia.gov.br/download-de-dados/despesas-execucao/202001" , destfile = "test_file.zip" , mode = "wb")

unzip("test_file.zip", exdir = "."

It would be nice if fread provide an option to deal with cases like this.

ben-schwen commented 3 years ago

There are several issues going on here.

  1. curl is not able to download "https://www.portaltransparencia.gov.br/download-de-dados/despesas-execucao/202001" but switching from HTTPS to HTTP solves this one and I would rather see this as an issues of curl

When the issue of downloading is solved by switching to HTTP with fread("http://www.portaltransparencia.gov.br/download-de-dados/despesas-execucao/202001") another one pops up:

  1. What you expect of fread is to automatically detect the filetype without file ending but that's not something fread does.

  2. Your file is a .zip which is not supported by fread yet, see also #3834

fabiocs8 commented 3 years ago

Thank you Ben.

When I run fread with verbose = TRUE (output in SO post link above), I understand that fread do download the file with no problem. However, the problem happens when decompressing it: because Windows interpret it as a a binary file, it changes '\n' line endings to '\r\n' (aka 'CRLF'), see the excelent answer provided by r2evans in SO. Using download.file ( .. , mode = "wb") is enough to solve this issue, and unzip works properly.

Amazingly, fread code in line 87 instructs curl with the option mode = "wb": curl::curl_download(input, tmpFile, mode = "wb", quiet = !showProgress)

So it seems that this mode option has no effect here....

Regards, Fabio.