Closed jbeaulie closed 10 years ago
Hi Jake,
I will follow up with you shortly with a algae.csv and a water_quality.csv file to you to look at. The files in step1/ are not fully formatted.
Thanks,
Matt
Date: Tue, 18 Feb 2014 07:44:23 -0800 From: notifications@github.com To: Phytoplankton-Data-Analysis@noreply.github.com Subject: [Phytoplankton-Data-Analysis] Preview processed data (#12)
I looked over the files in 'output/step1'. It contains two water chemistry files in long format (batch0, batch2), two files containing recent phytoplankton data (batch6, batch7), and one file that only contains identifier information (batch4). The only measurement in batch3 is chlorophyll and all values are reported as NA. Are these the files you wanted me to review?
— Reply to this email directly or view it on GitHub.
The algae data file looks great. I'm really looking forward to seeing the remaining observations.
I'm having trouble reading the water quality file. R reports "line 1065 did not have 26 elements" and an inspection of the .csv file in Excel revealed that the columns shift at line 1065. Could you take a look?
Water chem data
cleaned_algae_20140318.csv
There appears to be a variety of different formats used in this file. The first 158,392 observation in this file appear to be space separated. For example: "2EFR20001199905109999005" "EFR" "20001" 5 "19990510" "Anabaena #1" 36086.4265927978 77116693.6288089 NA "FALSE" 874 NA 62565 67585
However, a bunch of other observations look to be comma separated, or perhaps a combination of tab and comma separated. For example: \"2CRR20198199505039999000\",\"CRR\",\"20198\",\"000\",\"19950503\",\"Chrysophyte Flag #2\",6222.65625,746718.75,NA,FALSE,633
Could you please format all of the observations consistently?
An escape character "/" in one file names caused errors in the dataframe. I open and closed the original processed file in Excel. This seemed to address the \ issue.
A new file named cleaned_algae_20140324.csv is committed to the repo.
Date: Mon, 24 Mar 2014 13:02:34 -0700 From: notifications@github.com To: Phytoplankton-Data-Analysis@noreply.github.com CC: Matt_Pocernich@hotmail.com Subject: Re: [Phytoplankton-Data-Analysis] Preview processed data (#12)
cleaned_algae_20140318.csv
There appears to be a variety of different formats used in this file. The first 158,392 observation in this file appear to be space separated. For example:
"2EFR20001199905109999005" "EFR" "20001" 5 "19990510" "Anabaena #1" 36086.4265927978 77116693.6288089 NA "FALSE" 874 NA 62565 67585
However, a bunch of other observations look to be comma separated, or perhaps a combination of tab and comma separated. For example:
\"2CRR20198199505039999000\",\"CRR\",\"20198\",\"000\",\"19950503\",\"Chrysophyte Flag #2\",6222.65625,746718.75,NA,FALSE,633
Could you please format all of the observations consistently?
— Reply to this email directly or view it on GitHub.
Misc It appears that the header is repeated once in the dataframe.
date Some dates appear to formatted improperly. For example, '8242011' should probably read '20110824' Other dates are more difficult to decipher: '9040628'
lake I was surprised to see lakes from District 3 in the data set. 'grr' should read 'GRR'.
taxa *I found a few instances where it appears the data weren't imported properly. Possibly related to the escape characters referenced above? Below are examples pulled from: unique(algae$taxa) [1258] "Chrysophyta \A'\",14795.66564,63917275.54,NA,FALSE\"" [1259] "Chrysophyta \A'\",6733.695652,29089565.22,NA,FALSE\"" [1269] "Chrysophyta \A'\",2074.21875,8960625,NA,FALSE\"" [1271] "Chrysophyta \A'\",15996.04743,69102924.9,NA,FALSE\"" [1274] "Chrysophyta \A'\",3097.5,13381200,NA,FALSE\"" [1275] "Chrysophyta \A'\",2358.915441,10190514.71,NA,FALSE\""
class *should be entirely populated with NA. It currently contains some numbers:
unique(algae$class)
[1] NA "class" "25853" "31447" "35217" "48740" "48768" "53121" "53292" "107348" "107403" "113736"
hab *Should be TRUE of FALSE. Currently contains other values:
unique(algae$hab) # Some numbers in "hab" field [1] NA "FALSE" "TRUE" "hab" "26773" "32825" "37086" "51466" "51494" "55967" "56157" "114223" [13] "114295" "120894"
raw data I removed the questionable dates (see above) and inspected the data. A few things jumped out at me. MSR is the only lake with data for 2009; however, Jade sent us the complete 2009 data set. I believe you put it in 'originalData/algae/Jade'. The earliest observation in the data set is from 1988; however, the file 'All Lakes Phyto Data pre-1993.xlsx' contains observations dating back to 1973. Were the data from this file imported? This file appears in multiple places including 'Drew data/b' The latest observation in the data set is 2012-11-06; however, we have HAB data from EFR for numerous dates in 2013. The most recent data should be from 2014-01-14.
Re: Date
In files Drew data/f/2004.xls 2004.xls and Drew data/f/2003data.xls the files , there are some values in the DATE columns which are not valid dates. You can see them for filtering or sorting. I could manually make the first 4 digits in these records 2004 and 2003 - but this would be a guess based on the fact that the this substitution would make a valid date. Alternatively, these records could be deleted.
RE: dates All but one of the strange dates in the 2004 file are for station 20014 and depth 0 in lake CCK. If we assumed the samples were collected in 2004, the sample date would be 2004-06-28. The file already contains a robust data set for this lake-station-depth from 2004-06-29, so I'm not convinced we can guess the actual sampling date; lets ignore the questionable records. I'm reasonably confident that we can assume a 2003 sampling year for the questionable date in Drew data/f/2003data.xls. The data would fill in missing depths from the associated sampling station, lake, and date.
Regarding data in 'All Lakes Phyto Data pre-1993.xlsx', this file does not contain taxa. Can you provide guidance about how to code it?
All Lakes Phyto Data pre-1993.xlsx -well shoot, I had forgotten that. I don't think these data are useful without taxa info.
*'grr' should read 'GRR'. corrected in April 2 cleaned dataset.
dropped id1 and id3 from dataset. These are internal fields used to identify duplicate records and replicated data.
Fixed the script reading Jade's 2009 data.
I cam closing this issue. Please comment on April 2 cleaned Algae data on a new issue. I will check the WQ data issues within this task an move them to a new issue.
I looked over the files in 'output/step1'. It contains two water chemistry files in long format (batch0, batch2), two files containing recent phytoplankton data (batch6, batch7), and one file that only contains identifier information (batch4). The only measurement in batch3 is chlorophyll and all values are reported as NA. Are these the files you wanted me to review?