StefanoAllesina / BSD-QBio4

Fourth BSD Quantitative Biology Bootcamp @ MBL
GNU General Public License v3.0
4 stars 9 forks source link

File encoding doesn't work for some computers #5

Open grace-hansen opened 6 years ago

grace-hansen commented 6 years ago

Describe the bug On keyboards set to a non-English setting, e.g. for Mandarin speaking users, files will be encoded with ASCII, which throws errors unless explicitly stated on some R commands.

To Reproduce papers <- read.csv("~/BSD-QBio4/tutorials/basic_computing_2/data/citations/nature_neuroscience.csv", stringsAsFactors = FALSE) papers$TitleLength <- nchar(papers$Title) Error at [something]: invalid multibyte string at [something]

FIX: read csv with explicit file encoding, e.g. papers <- read.csv("~/BSD-QBio4/tutorials/basic_computing_2/data/citations/nature_neuroscience.csv", stringsAsFactors = FALSE,fileEncoding='ASCII')

pcarbo commented 6 years ago

@gracilis Thanks for sharing. I see that the file is encoded in UTF-8 because it has some non-ASCII characters. Perhaps in the future it would be simpler to remove (or replace) the non-ASCII characters from this file so that it is less likely to cause issues.

Can you please send us output from running sessionInfo()? In particular, I would be interested to find out what your locale setting is.

grace-hansen commented 6 years ago

Hi Peter,

Below is my sessionInfo when running the code, but the issue only occurs on a subset of computers whose defaults are set to ASCII to easily type in Mandarin (I think). Should I have one of those students send the sessionInfo for comparison?

sessionInfo()R version 3.4.4 (2018-03-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.5 LTS

Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] cowplot_0.9.3 readr_1.1.1 ggplot2_3.0.0

loaded via a namespace (and not attached): [1] Rcpp_0.12.18 digest_0.6.15 withr_2.1.2 dplyr_0.7.4 assertthat_0.2.0 grid_3.4.4 plyr_1.8.4 [8] R6_2.2.2 gtable_0.2.0 magrittr_1.5 scales_0.5.0 pillar_1.2.2 rlang_0.2.0 lazyeval_0.2.1 [15] bindrcpp_0.2.2 labeling_0.3 tools_3.4.4 glue_1.2.0 munsell_0.4.3 hms_0.4.2 compiler_3.4.4 [22] pkgconfig_2.0.1 colorspace_1.3-2 bindr_0.1.1 knitr_1.20 tibble_1.4.2

On Tue, Sep 4, 2018 at 9:29 PM, Peter Carbonetto notifications@github.com wrote:

@gracilis https://github.com/gracilis Thanks for sharing. I see that the file is encoded in UTF-8 because it has some non-ASCII characters. Perhaps in the future it would be simpler to remove (or replace) the non-ASCII characters from this file so that it is less likely to cause issues.

Can you please send us output from running sessionInfo()?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/StefanoAllesina/BSD-QBio4/issues/5#issuecomment-418579417, or mute the thread https://github.com/notifications/unsubscribe-auth/Ao3ieqkDthXlHOdi7eJ5pl2lDdMOmvUjks5uXzcjgaJpZM4WZ0Ao .

-- Grace Hansen MD/PhD Candidate | University of Chicago gthansen@uchicago.edu

pcarbo commented 6 years ago

@gracilis I will need the sessionInfo() from one the students experiencing the error.