DOI-USGS / lake-temperature-model-prep

Pipeline #1
Other
6 stars 13 forks source link

`data_queue` folder clean up and review #320

Closed padilla410 closed 2 years ago

padilla410 commented 2 years ago

I went through the data_queue folder cleaned out files/folders that have been processed by remained in the queue. Below is a summary of the remaining files. Most have temperature and some depth information.

For most of this data it looks like other subsets of this data have been processed. For example, the following data sets are all in the in folder:

However, as you can see in the table there is at least one more file available from the same group (e.g., MiCorps) or at the same location (e.g., Bull Shoals).

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

File | Type | Files | Loc | Parse | Yrs | Notes -- | -- | -- | -- | -- | -- | -- Data_from_LAGOS | folder | 54 | IL, IN, ND, SD |   | est early 90s-2019 | Has temperature, no parser? Not found in 'in'. There is a lot in here (54 files) and some have temperature (see explainer file). Dates are estimated based on a spot check of a few spreadsheets. Lake_clarity_PCA | folder | 24 | MN | N | NA | No temp found in data. Looks like secchi only (based on spot checks) Long29016100 | folder | 7 | MN |   | 2012-2018 | Has temp, reasonably organized. Long Lake, MN temp_DO_PCA | folder | 24 | MN |   | est pre-1990s-2010s | DO and temp, many locs. Looks like it came from a nice DB. Dates are estimated based on a spot check of a few spreadsheets. Ten_mile_lake_ BRUCE_ DOW_11041300 | folder | 3 | MN |   | pre 1990s-2015 | Has temp. 2-3 files spread across some subfolders. Very messy (some in pdfs, others in xls, many variable formats). Other Tenmile Lake files have been processed. NorfolkReservoir_* _ AR_monthlyTempDO_ 2016-2020.xlsx | file | 2 | AR |   | 2016-2020 | Has temp, no parser. Wasn't part of initial norfork request. IN_lakedata_IDEM.xls | file | 1 | IN | N | 2000-2007 | WQ only. No lake temperature. IN_lakedata_ IUSPEA_alldata.xls | file | 1 | IN |   | 1989-2010 | Temp data with depth and loc Indiana_GlacialLakes_ TempDOprofiles_5.6.13.xlsx | file | 1 | IN |   | 1965-2009 | Multiple IN lakes with temp. Waterbodies are named, but haven't run down locations MI_TEMPDO_AUG_ MGLPID.xls | file | 1 | MI |   | 2002-2009 | Has temp/depth. Locs are named. MiCorps_download_ 11_13_2019.xlsx | file | 1 | MI | N | 2010-2013 | No temp found in data. Leech_logger_ temps_06_16 | file | 1 | MN |   | 2006-2016 | Assuming Leech Lake MN. Other Leech data has been parsed. MASTER_mnlakedata_ historicalfiles_manualentry_ template.xlsx | file | 1 | MN |   | pre 1990s-2008 ish | Has temp/depth. Locs are named, includes DOW. Dates are an estimate (they're messy) Tenmile_2017_ PCA _Temperatures* | file | 2 | MN |   | 2017 | MN, has temp and depth. Other Tenmile Lake data has been processed. 20190409 DATA with all depths.csv | file | 1 | MO/AR |   | 2019 | Bull Shoals. Has temp.  Includes "Point" locations which may not have spatial reference. Bull Shoals and LOZ profile data LMVP | file | 1 | MO/AR |   | 2018-2020 | Good data - has temp. Has locations. Temp_DO BSL * | file | 14 | MO/AR |   | 2019-2020 | Bull Shoals, AR. Walleye Club data. Has temp, spatial info?

lindsayplatt commented 2 years ago

@padilla410 thanks for this comprehensive assessment of files hanging out in data_queue. I appreciate your notes about the complexity of the file formats to indicate whether it would be easy or hard to parse! I'll add to the project board so that Jordan and I can discuss during our big issue convo.

I am curious how difficult you think it would be to include a column with some info about the year(s) covered by the data? Seems like that might be nice to see in this table to help the discussion without digging into each.

jordansread commented 2 years ago

Great! 🎉 ~Does this also include the other subfolders in that queue, such as Data_from_LAGOS? ~ nevermind, first entry

jordansread commented 2 years ago

Do we need to link this to #45 and perhaps close that one out?

padilla410 commented 2 years ago

@jread-usgs Taking a closer look at #45, yes I think so. I'll check it out when I address #267

jordansread commented 2 years ago

PS, I added #261 profile data to the queue, along with an explainer. As part of adding UNDERC depths for #261, there will be a crosswalk in place for these lakes and the data are formatted in a pretty straightforward way.

padilla410 commented 2 years ago

Below is a detailed summary for the datasets listed in the initial issue. In total, I think 8 of the files are candidates for parsers. Details on "why" are in the Follow-up column of the table below. Here are some of the criteria I considered:

To do:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

File | Follow-up | Loc, Date, Depth, Temp | Status | Data summary -- | -- | -- | -- | -- Data_from_LAGOS | Reviewed `explain` file (Data_from_LAGOS). Filtered for datasets that have values in the following fields: `Lat column`, `Long column`, `Depth identifier`, `Temperature identifier`. No datasets in this folder have data for all the columns listed, therefore not further action will be taken. | N | No further action needed |   Lake_clarity_PCA | No temperature data found | N | No further action needed |   Long29016100 | Data appears to be in reasonable shape (similar formats from year-to-year and sheet-to-sheet). Contains depths and temps. | Y | Candidate for parsing | One lake with a DOW, 7 years of data temp_DO_PCA | Data is in a reasonable format, contains dates, times, depths, and temps for 24 lakes (with DOW numbers) | Y | Candidate for parsing | 24 lakes with DOWs, multiple years Ten_mile_lake_  BRUCE_  DOW_11041300 | Data for Tenmile Lake (in MN, has DOW), collected by volunteers. Contains depths and temps but formats are all over the place. Tenmile lake is already well characterized with ~300+ profiles. Parsing this data set is not worth the time investment. | Y | No further action needed |   NorfolkReservoir_* _ AR_monthlyTempDO _ 2016-2020.xlsx | Contains depth, temp for two known locations in Norfork reservoir. The format is not straight forward, but it consistent. Lake is currently well characterized (> 400 profiles) | Y | Candidate for parsing | One lake, two locations, 5 years of data IN_lakedata_ IDEM.xls | No temp data  | N | No further action needed |   IN_lakedata_ IUSPEA_alldata.xls | Contains LakeID, lat/long, and temp for multiple lakes. Data are not profiles. Upon closer review only "max depth" is provided | N | No further action needed |   Indiana_ GlacialLakes_ TempDOprofiles_ 5.6.13.xlsx | Contains depth and temp for multiple lakes. Doesn't contain any spatial info (lat/long, LakeID, etc) so there is no way to match locations and be sure they are right. | N | No further action needed |   ~MI_TEMPDO_ AUG_  MGLPID.xls~ | ~Contains temp, depth, and LAKE_ID. `explain` file notes that LAKE_ID == MGLP ID (which we have an xwalk for). HOWEVER, there appears to be an issue with this field - multiple lakes have the same ID. For example: c(Kelly Lake, Caro Impoundment, Devils Lake) == "MINo WB" <-- these will fall away~ | N | No further action. Data has year and month, but no day | ~393 lakes, 17 years of data~ MiCorps_download_ 11_13_2019.xlsx | No temp data | N | No further action needed |   Leech_logger_ temps_06_16 | Not really profiles. One depth at two locations in Leech Lake, which is already well characterized (> 100 profiles) | Y | No further action needed |   ~MASTER_mnlakedata_ historicalfiles_manualentry_ template.xlsx~ | ~Has loc, temp, depth, DOW. Not as messy as first thought.~ DOW numbers are inconsistently missing leading and trailing zeros. Would require manual verification for 204 lakes. | Y | No further action. DOW numbers need manual verification. | ~204 lakes, data ranges from 1905-1991~ Tenmile_2017_  PCA _Temperatures* | Has loc, temp, depth. But it is only one profile for one year. | N | No further action needed |   20190409 DATA with all depths.csv | Has temp, depth, and location is known for crosswalk. | Y | Candidate for parsing | One lake, 2019 Bull Shoals and LOZ profile data LMVP |   | Y | Candidate for parsing | One lake, 2018-2020 Temp_DO BSL * | Has temp and depth. No explicit spatial info for points, but BS location is known. | Y | Candidate for parsing | One lake, 1 year (2020)

lindsayplatt commented 2 years ago

Can I ask that we add one more step to this to-do list? For any data that we are not planning to pursue, can we make folder within data_queue called not_pursuing, move those files there, and then include a README doc that links to this very comment?

lindsayplatt commented 2 years ago

Thanks for this AMAZING summary! There's one that I would like to counter as a "No further action needed" ... IN_lakedata_ IUSPEA_alldata.xls. You wrote that data are not profiles and only "max depth" is provided. I think that is actually all that GLM needs, so I'd like @jread-usgs to weigh in on whether or not that may still be useful.

jordansread commented 2 years ago

Great! Yes, related to Lindsay's comment - some of these may be a "no" for creating a temperature parser but might need to be mentioned elsewhere (e.g., IN_lakedata_ IUSPEA_alldata.xls is depth info, which I'd guess is redundant with the new LAGOS US depths; Lake_clarity_PCA doesn't have temperatures but it probably has secchi data, which is relevant for our other data needs).

It is disappointing that the LAGOS bag-o-data doesn't contain retrievable temperature data. I had been expecting a clunky set of files but also assuming that there would be temperature data in at least some of them.

padilla410 commented 2 years ago

@jread-usgs, now that I am thinking about it, I'm going to walk back the LAGOS assessment.

First a question: does data in Data_from_LAGOS automatically mean that there is spatial data associated with it? If that is the case, then we do have usable data from this source. There are 22 files that have temp and depth data - however, the temp/depth data are not always profiles (i.e., some data is one temp measurement per event). I would probably parse this data set last. The 22 files are all in good shape, but I think there might be a good bit of variability from file to file (there might be ~4-5 file format "families").