Unusual characters created when reading Excel files

jbeaulie commented 10 years ago

I have noticed that some taxa names contain unusual characters that don't appear in the original records. Below are a few examples where R has written in an inverted breve above a capital A where no accent marks appear in the original records. There are also examples where accented letters (e, u) are read in strangely. Could you review the list of unique(algae$taxa) and investigate instances of strange accents or symbols?

This isn't a huge issue, but it complicates the phycologists job of tracking down the proper taxonomic classification for all these different taxa.

mjpdenver commented 10 years ago

I will continue to investigate this. I think the problem is driven by the accent marks and foreign punctuation, complicated by the fact that sometimes they get interpreted correctly by the program - like Excel - on other times they cause issues.

I might just do a brute force substitution - because ultimately - I don't think there are a ton of them.

Date: Mon, 12 May 2014 05:43:31 -0700 From: notifications@github.com To: Phytoplankton-Data-Analysis@noreply.github.com Subject: [Phytoplankton-Data-Analysis] Unusual characters created when reading Excel files (#34)

I have noticed that some taxa names contain unusual characters that don't appear in the original records. Below are a few examples where R has written in an inverted breve above a capital A where no accent marks appear in the original records. There are also examples where accented letters (e, u) are read in strangely. Could you review the list of unique(algae$taxa) and investigate instances of strange accents or symbols?

This isn't a huge issue, but it complicates the phycologists job of tracking down the proper taxonomic classification for all these different taxa.

— Reply to this email directly or view it on GitHub.

jbeaulie commented 10 years ago

I ran into this on another project. Using read.table() to pull in a .txt file seemed to handle accent marks and foreign punctuation better than using read.xls() from the gdata package to pull in an Excel file. In our project this isn't a big enough issue to merit switching over to read.table().

mjpdenver commented 10 years ago

Oddly enough, on another project yesterday I found this problem addressed by specifying encoding = "latin1" in read.table. Aside from the fact that it worked, I don't exactly know why.

Date: Tue, 13 May 2014 11:32:46 -0700 From: notifications@github.com To: Phytoplankton-Data-Analysis@noreply.github.com CC: Matt_Pocernich@hotmail.com Subject: Re: [Phytoplankton-Data-Analysis] Unusual characters created when reading Excel files (#34)

I ran into this on another project. Using read.table() to pull in a .txt file seemed to handle accent marks and foreign punctuation better than using read.xls() from the gdata package to pull in an Excel file. In our project this isn't a big enough issue to merit switching over to read.table().

— Reply to this email directly or view it on GitHub.

mjpdenver commented 10 years ago

Oddly enough the command unique(algae$taxa) - visually at least - returns correct text with accent marks.

[21] "Achnanthidium minutissimum (Kütz.) Czarn."

When I write.table or write.csv, and open the document in Excel - I get a bit of gibberish. I tried playing with the encoding option- but that did not seem to help.

was able to use XLConnect to write and excel xlsx file which read correctly. Are you able to use this package?

I committed a file in output/taxa.xlsx. Can you tell me if this is sufficient to share with the micro biologist.

I would like to get a real answer to this question. Perhaps Jeff would know.

jbeaulie commented 10 years ago

output/taxa.xlsx doesn't exist. Perhaps you forgot to push it into the repository. It may not have been that helpful anyway. Turns out XLConnect has conflict with another program installed on EPA computers.

jbeaulie commented 10 years ago

Strange, I read in the file using: algae <- read.delim("processed_data/cleaned_algae_20140509.txt", as.is=TRUE, header = TRUE)

Then looked at taxa names using: unique(algae$taxa)[order(unique(algae$taxa))]

and got

[20] "Achnanthidium minutissimum (KÃ¼tz.) Czarn"
[21] "Achnanthidium minutissimumÂ (KÃ¼tz.) Czarn."

What are we doing differently?

mjpdenver commented 10 years ago

Likely I forgot to push it to github. I added it to processed_data/taxa.xlsx

mjpdenver commented 10 years ago

Can you type these commands in R? By default, I believe the encoding is OS specific and this might be a Mac vs Windows issue.

options()$encoding [1] "native.enc" Sys.getlocale("LC_CTYPE") [1] "en_US.UTF-8"

Date: Wed, 4 Jun 2014 12:42:37 -0700 From: notifications@github.com To: Phytoplankton-Data-Analysis@noreply.github.com CC: Matt_Pocernich@hotmail.com Subject: Re: [Phytoplankton-Data-Analysis] Unusual characters created when reading Excel files (#34)

Strange, I read in the file using:

algae <- read.delim("processed_data/cleaned_algae_20140509.txt", as.is=TRUE, header = TRUE)

Then looked at taxa names using:

unique(algae$taxa)[order(unique(algae$taxa))]

and got

[20] "Achnanthidium minutissimum (KÃ¼tz.) Czarn"

[21] "Achnanthidium minutissimumÂ (KÃ¼tz.) Czarn."

What are we doing differently?

— Reply to this email directly or view it on GitHub.

jbeaulie commented 10 years ago

Here is my result:

Sys.getlocale("LC_CTYPE") [1] "English_United States.1252"

I tried changing the setting to match your computer:

Sys.setlocale(category = "LC_CTYPE", locale = "en_US.UTF-8") [1] "" Warning message: In Sys.setlocale(category = "LC_CTYPE", locale = "en_US.UTF-8") : OS reports request to set locale to "en_US.UTF-8" cannot be honored

I suppose I don't have "en_US.UTF-8" on my computer. Could you try switching yours to "English_United States.1252"?

jbeaulie commented 10 years ago

During 7/16/14 phone call we decided to address before sending the taxa list to Lisa. If a simple solution cannot be found, Matt will implement a brute force approach.

willbarnett commented 9 years ago

Catching up on some items here that haven't been documented.

The list of unique algae taxa names in February 2015 came from reading in 'processed_data/cleaned_algae_20150202.csv', grabbing the unique taxa names, and exporting a .csv file with those 1834 results. Although no special characters appears in this list of taxa names when viewing the strings in R, the process of writing this file introduced at least one special character (a capital A with a chevron hat on top) -- which can be seen in several records.

This output file is 'processed_data/unique_algae_taxa_names_20150202.csv' .

willbarnett commented 9 years ago

Recording an e-mail exchange in May and June 2015 about the unique taxa name issue:

(Will): Hey Jake,

I committed two 'new' algae files to github this morning. Several strange things are happening when these files get written. Here's my best understanding right now:

The umlauts and all the accents on lowercase letters that can be seen in R should be deleted, and should have been for any cleaned files produced since January 2015.
Writing .csv and .txt files with the algae data does cause different special characters to be created, which are not able to be seen in R when viewing the character strings (e.g., capital A with accent over it). I have tried a few different encodings, without success.
Writing the cleaned algae file with write.table() using a tab delimiter gives a different number of rows than using write.csv() with a comma separator. This is particularly baffling, as both of these file types are supposed to be platform-independent and extremely common.
Writing the cleaned algae file as a .xlsx using XLConnect fixes the row count issue and the newly-introduced-special-character issue. So I think this is the best path forward, albeit one with it's own problems. XLConnect can be finicky about the size of files you can read/write (Java memory errors).

(Jake): Hi Will,

Unfortunately, EPA computers don’t support XLConnect. It is a java issue that no one has been able to resolve. Could you try writing the excel file with the gdata package? It uses strawberry perl and works well on EPA systems.

Writing to Excel does seem to have fixed the issue. Oddly, our algae file doesn’t completely match up with BSA’s list. The algae file contains 5 taxa not in BSA’s Original.Taxa.Name column. Any ideas on how that could have occurred? Fortunately, it is pretty easy to match them to the appropriate taxonomic information. The case of “Crucigenia quadrata C. Morren” is a bit puzzling, however. As best I can tell, it is included in our dataframe AND the BSA list, but R doesn’t agree. There must be some subtle difference between the entries that I can’t identify. Could you take a look? The relevant code can be found in lines 143-158 of PreviewProcessedDataJake.R.

Also, I would like to capture these exchanges on the established github issue for this problem. Could you update the issue, including the e-mails we exchanged when you were preparing the list of taxa names to send to BSA? Lets continue the conversation by posting new comments to the github issue.

willbarnett commented 9 years ago

The previous post has a couple important points worth addressing.

1) To the extent there are taxa names not included in the February 2015 list of unique names, this is puzzling. It seems that write.csv() is doing weird things, so perhaps special characters are the problem here. I'm finding that even a space in R might be an encoded space, and two seemingly identical character strings might not be identical.

2) I forgot about the non-compatibility between XLConnect and EPA computers. It doesn't look like gdata has a function to write Excel files, only read them. Am I mistaken? There is also the xlsx package -- is that a tenable solution? I don't know if it depends on Java like XLConnect; it very well might.

jbeaulie commented 9 years ago

1) I think the February 2015 list of unique names has a few formatting bugs. For example, when read into R the list contains duplicate taxa names. Obviously R felt these names were distinct when you created the list. This caused headaches when I merged BSA's product with our cleaned algae file. I'm not sure why the February 2015 list is missing taxa; the issue is likely related to subtle formatting differences. It was easy to match the missing taxa to BSA's list, so I suppose we can overlook the issue. Hopefully working with the Excel version of the cleaned algae will take care of the issue for good.

2) You are right, gdata doesn't write Excel files. xlsx is also java dependent, so that won't work. We have tried about every other option, so I think we are stuck with XLConnect. If I really need to recreate the file at some future date, I wonder if I could work around the problem by writing a .csv, opening the .csv in Excel, then saving the file as a .xls?

3) Do you know if the Excel file contains any of the problematic formats, either when viewed in Excel or in R?

USEPA / Phytoplankton-Data-Analysis

Unusual characters created when reading Excel files #34