KentonWhite / ProjectTemplate

A template utility for R projects that provides a skeletal project.
http://projecttemplate.net
GNU General Public License v3.0
623 stars 159 forks source link

Encoding issue in munge script #203

Closed NanisTe closed 6 years ago

NanisTe commented 7 years ago

Hi,

following problem exists.

When I use "äöü" in my R scripts they run perfectly fine in RStudio but for example in the munch script the ProjectTemplate does not like it and throughs an error. It is pretty unusable for me. I am using for compatibility and transferrability reasons the UTF-8 format for all data and R scripts.

Autoloading data Loading data set: Data.DE Loading data set: Data.FR Loading data set: Data.IT Munging data Running preprocessing script: 01-Prepare Zusagen Data.R Error in source(file.path("munge", preprocessing.script)) : munge/01-Prepare Zusagen Data.R:15:35: unexpected input 14: Anhang2=Anhang.2, 15: Bestä ^

The same issue I have with encoding of my data files. Therefore I have to do the read in in the munge script again.

Please tackle that issue with encoding.

Best regards Sinan

KentonWhite commented 7 years ago

@NanisTe This is useful to know. It is confusing that the script runs fine for you within R but has an error when you run it through ProjectTemplate. ProjectTemplate is not doing any additional processing than "sourcing" the munge script.

Can you help me figure out what is happening please. After loading the data with load.project() and getting the error above, can you try running source('munge/01-Prepare Zusagen Data.R') and also source('munge/01-Prepare Zusagen Data.R', envir = ProjectTemplate:::.TargetEnv).

Please let me know if either of these can run without an error. If they do have an error, can you please provide the error.

A little background of what we are testing. ProjectTemplate uses its copy of the R_GlobalEnv to call source on your files. This should be the same as running source directly from R. If the first call above works and the second doesn't, then that suggests that ProjectTemplate is not using your R_GlobalEnv or is making some changes to the R_GlobalEnv. If the first call above also produces an error, then there is an issue running your script that is independent of ProjectTemplate, since the source() call is not using the ProjectTemplate framework.

NanisTe commented 7 years ago

I will try to test that. But I think I figured out what was the problem. It all started with the problem that the read statements in ProjectTemplate seem not to use the right encoding. So all the äöü characters and also french characters from my csv or excel files where parsed wrong.

So I tried to change the encoding within the RStudio since there was no option to change the encoding in ProjectTemplate. The problem occured right after. After setting back the global encoding from utf-8 to system standard the script worked fine again in ProjectTemplate. I thought by that time it should be actually the best practice to also save the script files with a better shareable encoding format. I don't know if this makes sense and if R or the operating systems can handle the encodings correctly if special characters are used in the scripts. Do you know any best practice according to that?

There is still the problem that my data gets read in with character errors like in one of the other issues posted here. I am using now my own data read functions with utf-8 encoding in the munging script to get the data right. But that is only a not very pleasing workaround.

I would suggest to allow the user to define encodings for each data file somehow. Because the data loading process is actually very central for the use case of the ProjectTemplate package.

Thanks for your hard work by the way. This data loading and munging steps where always a pain for us. No one likes to have too many repetitive tasks.

KentonWhite commented 7 years ago

Ah! It is keeping the encodings when reading them from a data file. I run into this problem myself (quite a bit of French language processing). At this time it is a problem that doesn't look like it can be solved with a global configuration. What must happen is that the read.csv() command is passed fileEncoding=latin1. If no encoding is passed, read.csv() will not default to the encoding that is specified in the global environment. Instead, it will default to the native encoding of the file. Most .csv files default to UTF-8, which doesn't help.

We've tried to tackle the encoding problem in previous threads. The issue is that some files we would want to read as UTF-8 and others as Latin1. We could set a global parameter for encoding, say encoding: latin1 in the global.dcf, but then files encoded with UTF-8 would be read incorrectly.

Ideas for handling encoding have ranged from trying to guess what the file encoding is (hard!); using custom suffixes to specify the encoding, like .csv.latin1, but this isn't portable; or providing an encoding manifest (file1.csv: UTF-8, file2.csv: latin1), which creates a bit of work for the user. None of these solutions fit the elegance of ProjectTemplate. Would love your opinion!

For a short term solution, there are 2 ways around the encoding:

  1. Use a file format that stores the encoding. .csv doesn't store the encoding, but .xlsx does. You can write a short script that converts your latin1 csv files to xlsx files with the correct encoding. ProjectTemplate should then respect the encoding. If it doesn't let me know! (i've run into trouble opening latin1 encoded csv files in excel and saving them -- excel really screws up the latin1 encoding)

  2. Create a latin1 folder and then a custom .R script in the data folder. This is what I end up doing. The R script loops through the latin1 folder and uses read.csv(..., fileEncoding='latin1')

Now that I think about it, maybe creating a latin1 data folder might be the solution. What are your thoughts on this?

Hugovdberg commented 7 years ago

A latin1 directory isn't a very clean solution I think, before we know it we have UTF-8, UTF-16, ISO 8859-7 etc. I think the encoding flag in global.dcf is the best way to go, much like proposed in issue #189, provided we allow an override through a .file file with specific settings for a single file. This might create some more work for a user with files in 20 different encodings, but those people have a lot of work anyway ;-) If you can specify the majority of the encodings and override the exceptions it's the cleanest solution in my opinion.

KentonWhite commented 6 years ago

This looks like it as fixed by #187. Closing. If this wasn't fixed there please re-open!

anu87 commented 3 years ago

I still have this issue. One of my munge scripts contains UTF-8 character and load.project() fails

KentonWhite commented 3 years ago

@anu87 Can you please post the error you are getting and the munge file that is causing the error? Is it that the munge file is not reading properly or that data from a previous loaded csv file is not read properly?