Open mjpdenver opened 9 years ago
not sure how a ph of 22 or-1.5 would be there, as our system doesn't allow that---so maybe in some data manipulation? We have some auto validation steps.
For our data a BLANK means no data exists and 0 means a measurement was taken and value was below method's detection limit. Sometimes some programs will put 0's in blanks, dbase used to do that so we used -9's to represent no data present vs 0 value. The primary purpose of doing our watershed reports is to find outliers that we do let go through when we validate "batches" of data and do not get to see that batch in context with previous data from that site...--We are moving around the state finishing these WSR for the first time, just finishing CO and SP basins (huge basins), so all data corrections have not been completed on actual database. Once we get around the state one time, we can "add" new data and find outliers quicker if you will. This has been a decade goal of mine, slowly but finally happening. I do these WSR w/ a 6 month temp and an intern, and my available time, so it will take us about another 2 years likely to get through the entire state one time and on an update rotating basis. Bottom line, is the outlier problem is being dealt w/, just slowly. If in this process some are identified before we get to them, we can mark them for that analyses.
Remember too, that some analytes came on later in the program. Ca/ Mg we started when we switched for AA to ICP around 2000, so previous to that will be no CA values.
On Sat, May 9, 2015 at 11:00 PM, mjpdenver notifications@github.com wrote:
Hi,
I have made a few time series plots for the more commonly measured analytes at the more popular sites. For reasons discussed earlier - I also am only looking at data collected after 1999.
I made a simple shiny plot and deployed it at shinyapp.io. There left panel has a drop down menu to choose and analyte.
(https://mjpdenver.shinyapps.io/shinyPlot/)
Looking through the plots, I see what I assume are bad (or really interesting) results. For example, for Calcium, at the first site we see a very high value. For pH, there is a value of 22 and one of ~1.5.
Flipping through the analytes, I so notice that prior to 2005, zero value are more often measured. Again, looking at Calcium - is it possible that river samples could have no measurable calcium? My aqueous chemistry is a bit rusty, but isn't some calcium in nearly all river water.
We could - but I don't think this is the place - try and identify outliers. But that would be an project unto itself and shouldn't be done by statisticians who don't have knowledge of river water chemistry. Moreover, this seems to be more an issue to address with the Colorado Data Sharing Network.
Any ideas?
— Reply to this email directly or view it on GitHub https://github.com/CoWy-ASA/RiverWatch/issues/9.
Barb HornStatewide Water Quality Specialist Water Unit
P 970.382.6667 | F 970.247.4785 151 East 16th Avenue, Durango, CO 81301 barb.horn@state.co.us | cwp.state.co.us
"Humankind, be both"
Hi Barb, I double checked the original data queried from the CDSN and pH values of 0.85 and 21.7 are in the original query. There are no negative concentrations in the data base. If it is a possibility, I would suspect some of the zero values might be the results of groups using zero instead of blanks. Calcium jumped out at me because I understand it to be a common chemical in natural water often around 20 - 50 mg/ L and the detection limit is so low ( 0.2 mg/ L). The other anecdotal evidence is that in locations where there does not appear to be a trend in detected values, the use of zero seems to abruptly stop around 2005. Metals could be tougher - because non-detects are expected. A tool here might be to look at cases where zero is recorded as a result for every analyte in a sample. For example, looking at the Emma site - there are six zero values recorded for calcium and for iron - and it from all the other samples it seems unlikely either would not be detected. Samples where all results are recorded as zero could be identified and coded to suggest that the zero is likely not an ND. We can talk more about it - but I think we could create a flag field to indicate where zero's might be mis-used and also identify extreme values - such as the really high and low pH. Thanks, Matt
Date: Sun, 10 May 2015 08:31:16 -0700 From: notifications@github.com To: RiverWatch@noreply.github.com CC: Matt_Pocernich@hotmail.com Subject: Re: [RiverWatch] Bad data?/ Shiny App (#9)
not sure how a ph of 22 or-1.5 would be there, as our system doesn't allow
that---so maybe in some data manipulation? We have some auto validation
steps.
For our data a BLANK means no data exists and 0 means a measurement was
taken and value was below method's detection limit. Sometimes some
programs will put 0's in blanks, dbase used to do that so we used -9's to
represent no data present vs 0 value. The primary purpose of doing our
watershed reports is to find outliers that we do let go through when we
validate "batches" of data and do not get to see that batch in context with
previous data from that site...--We are moving around the state finishing
these WSR for the first time, just finishing CO and SP basins (huge
basins), so all data corrections have not been completed on actual
database. Once we get around the state one time, we can "add" new data and
find outliers quicker if you will. This has been a decade goal of mine,
slowly but finally happening. I do these WSR w/ a 6 month temp and an
intern, and my available time, so it will take us about another 2 years
likely to get through the entire state one time and on an update rotating
basis. Bottom line, is the outlier problem is being dealt w/, just slowly.
If in this process some are identified before we get to them, we can mark
them for that analyses.
Remember too, that some analytes came on later in the program. Ca/ Mg we
started when we switched for AA to ICP around 2000, so previous to that
will be no CA values.
On Sat, May 9, 2015 at 11:00 PM, mjpdenver notifications@github.com wrote:
Hi,
I have made a few time series plots for the more commonly measured
analytes at the more popular sites. For reasons discussed earlier - I also
am only looking at data collected after 1999.
I made a simple shiny plot and deployed it at shinyapp.io. There left
panel has a drop down menu to choose and analyte.
(https://mjpdenver.shinyapps.io/shinyPlot/)
Looking through the plots, I see what I assume are bad (or really
interesting) results. For example, for Calcium, at the first site we see a
very high value. For pH, there is a value of 22 and one of ~1.5.
Flipping through the analytes, I so notice that prior to 2005, zero value
are more often measured. Again, looking at Calcium - is it possible that
river samples could have no measurable calcium? My aqueous chemistry is a
bit rusty, but isn't some calcium in nearly all river water.
We could - but I don't think this is the place - try and identify
outliers. But that would be an project unto itself and shouldn't be done by
statisticians who don't have knowledge of river water chemistry. Moreover,
this seems to be more an issue to address with the Colorado Data Sharing
Network.
Any ideas?
—
Reply to this email directly or view it on GitHub
Barb HornStatewide Water Quality Specialist
Water Unit
P 970.382.6667 | F 970.247.4785
151 East 16th Avenue, Durango, CO 81301
barb.horn@state.co.us | cwp.state.co.us
"Humankind, be both"
— Reply to this email directly or view it on GitHub.
I think you should flag any suspect data, no validation process if perfect. I went back into database and could not find any records w/ pH of 22, but found some that had of 10+ that likely were recordings of the buffer not river. Also found about 5 records w/ pH 0- corrected those. Most were from migrated data. Same with Ca issue found about 10 records w/ Ca 0 vs blank and 0's were in many metals...so corrected those. Doesn't seem to be a system issue.
On Sat, May 9, 2015 at 11:00 PM, mjpdenver notifications@github.com wrote:
Hi,
I have made a few time series plots for the more commonly measured analytes at the more popular sites. For reasons discussed earlier - I also am only looking at data collected after 1999.
I made a simple shiny plot and deployed it at shinyapp.io. There left panel has a drop down menu to choose and analyte.
(https://mjpdenver.shinyapps.io/shinyPlot/)
Looking through the plots, I see what I assume are bad (or really interesting) results. For example, for Calcium, at the first site we see a very high value. For pH, there is a value of 22 and one of ~1.5.
Flipping through the analytes, I so notice that prior to 2005, zero value are more often measured. Again, looking at Calcium - is it possible that river samples could have no measurable calcium? My aqueous chemistry is a bit rusty, but isn't some calcium in nearly all river water.
We could - but I don't think this is the place - try and identify outliers. But that would be an project unto itself and shouldn't be done by statisticians who don't have knowledge of river water chemistry. Moreover, this seems to be more an issue to address with the Colorado Data Sharing Network.
Any ideas?
— Reply to this email directly or view it on GitHub https://github.com/CoWy-ASA/RiverWatch/issues/9.
Barb HornStatewide Water Quality Specialist Water Unit
P 970.382.6667 | F 970.247.4785 151 East 16th Avenue, Durango, CO 81301 barb.horn@state.co.us | cwp.state.co.us
"Humankind, be both"
I have made the following observations and censoring analyzing the small_dat set of common analytes and sites.
For these events, the maximum value of the 2 or three records was taken and a flag = 777 was used. This reduced the small dataset from 59,000 rows to approximately 54,017.
I should be clear that I don't believe these records are invalid. This issue is they may lack an identifier to label them a duplicate or a blank.
For review, the removed records were stored in an Excel spreadsheet data/censored/multipleSamples.xls
This censoring removed the zero values questioned above and the new binary files have been replaced in the repository.
Is this from the CDSN download dataset or the excel file--I assume the CDSN download? Our system does not allow for duplicates at all and no qaqc data is uploaded so CDSN, so blks/dups are not what you are seeing. I have a feeling it is a duplicate based on how AQWMS / STORET/DSN treats replacement data vs a duplciate in thed database, but am investigating. We found this an issue last year and supposedly did a complete delete (they did for us) and new upload. Barb
On Thu, May 21, 2015 at 11:41 PM, mjpdenver notifications@github.com wrote:
I have made the following observations and censoring analyzing the small_dat set of common analytes and sites.
1.
Small_dat has 60,000 records. For 1,447, Result.Values is less than the Detection.Quantitation.Limit.Value1 and greater than zero. This is not a problem if detected values less than the detection limit are reported. This could be an issue if only sometimes values less than detection limit are reported as measured - but other times they are reported as 0. No Action Taken. 2.
For 3,142 events - more than one record exists for each date/site/analyte. Infact, the records are identical across every field except for the result (sometimes) and the Result.UID. Looking closely at these records, it is plausible that some of these events collected duplicates and a blank. For example, one might see 2 values of calcuim around 82 mg/l and one reported as zero.
For these events, the maximum value of the 2 or three records was taken and a flag = 777 was used. This reduced the small dataset from 59,000 rows to approximately 54,017.
For review, the removed records were stored in an Excel spreadsheet data/censored/multipleSamples.xls
This censoring removed the zero values questioned above and the new binary files have been replaced in the repository.
— Reply to this email directly or view it on GitHub https://github.com/CoWy-ASA/RiverWatch/issues/9#issuecomment-104518434.
Barb HornStatewide Water Quality Specialist Water Unit
P 970.382.6667 | F 970.247.4785 151 East 16th Avenue, Durango, CO 81301 barb.horn@state.co.us | cwp.state.co.us
"Humankind, be both"
This is CDSN data. If it would be helpful, I could send a note and an example to the folks at CDSN if that would be helpful. I could easily be wrong, but if quality data like duplicates, splits or blanks are included in the database, there should be a field that clearly identifies the record as such.
For our task right now - I don't think this issue will block us.
I would like to see the data you are talking about. CDSN does allow folks to label or id qa data---but again RW doesn't upload any qa data, so it wouldn't apply to use....my temp is doing an upload now and now is the time I have a person to look at this..if it is at all possible just a sample of what you found...thanks! Barb
On Tue, May 26, 2015 at 9:35 AM, mjpdenver notifications@github.com wrote:
This is CDSN data. If it would be helpful, I could send a note and an example to the folks at CDSN if that would be helpful. I could easily be wrong, but if quality data like duplicates, splits or blanks are included in the database, there should be a field that clearly identifies the record as such.
For our task right now - I don't think this issue will block us.
— Reply to this email directly or view it on GitHub https://github.com/CoWy-ASA/RiverWatch/issues/9#issuecomment-105569886.
Barb HornStatewide Water Quality Specialist Water Unit
P 970.382.6667 | F 970.247.4785 151 East 16th Avenue, Durango, CO 81301 barb.horn@state.co.us | cwp.state.co.us
"Humankind, be both"
From the subset of the commonly monitored 20 stations and top 40 analytes, the records I found are saved here https://github.com/CoWy-ASA/RiverWatch/blob/master/data/censored/multipleSamples.xls . Generally, the CDSN files have about 180 fields. These records were flagged because all fields except the Result.UID and (sometimes) the result differ. I could check your new data or show your temp how to use our code if that would help. Looking at the flagged records, they events seem to only occur from 2001 to 2004, so I suspect your upload data is fine with respect to this issue. Thanks, Date: Tue, 26 May 2015 13:58:12 -0700 From: notifications@github.com To: RiverWatch@noreply.github.com CC: Matt_Pocernich@hotmail.com Subject: Re: [RiverWatch] Bad data?/ Shiny App (#9)
I would like to see the data you are talking about. CDSN does allow folks
to label or id qa data---but again RW doesn't upload any qa data, so it
wouldn't apply to use....my temp is doing an upload now and now is the time
I have a person to look at this..if it is at all possible just a sample of
what you found...thanks! Barb
On Tue, May 26, 2015 at 9:35 AM, mjpdenver notifications@github.com wrote:
This is CDSN data. If it would be helpful, I could send a note and an
example to the folks at CDSN if that would be helpful. I could easily be
wrong, but if quality data like duplicates, splits or blanks are included
in the database, there should be a field that clearly identifies the record
as such.
For our task right now - I don't think this issue will block us.
—
Reply to this email directly or view it on GitHub
https://github.com/CoWy-ASA/RiverWatch/issues/9#issuecomment-105569886.
Barb HornStatewide Water Quality Specialist
Water Unit
P 970.382.6667 | F 970.247.4785
151 East 16th Avenue, Durango, CO 81301
barb.horn@state.co.us | cwp.state.co.us
"Humankind, be both"
— Reply to this email directly or view it on GitHub.
After further investigation of the apparent duplicates I am confused and believe it is a DSN issue. We cannot find any results for the records you pulled out in the spread sheet. We reproduced and upload of those same stations/events and data is not duplicated- assume data we see is the record you kept.
We did do a complete renewal upload last year to STORET via AWQMS/CDSN. At EPA national STORET, their warehouse--they deleted all our records before we uploaded. AWQMS did not---and it appears that old data was not written over and/or deleted (still not sure where those values came from).
Many of those event numbers, but not all, had dup and blank data associated with it and are in the era we migrated data from dbase to SQL database.
Here is my current theory on this, which may not matter to you all. Last year we did a complete new upload. STORET del everything and AWQMS did not. So, I am testing that theory right now. STORET should not produce the same output then as CDSN.
Barb
On Thu, May 21, 2015 at 11:41 PM, mjpdenver notifications@github.com wrote:
I have made the following observations and censoring analyzing the small_dat set of common analytes and sites.
1.
Small_dat has 60,000 records. For 1,447, Result.Values is less than the Detection.Quantitation.Limit.Value1 and greater than zero. This is not a problem if detected values less than the detection limit are reported. This could be an issue if only sometimes values less than detection limit are reported as measured - but other times they are reported as 0. No Action Taken. 2.
For 3,142 events - more than one record exists for each date/site/analyte. Infact, the records are identical across every field except for the result (sometimes) and the Result.UID. Looking closely at these records, it is plausible that some of these events collected duplicates and a blank. For example, one might see 2 values of calcuim around 82 mg/l and one reported as zero.
For these events, the maximum value of the 2 or three records was taken and a flag = 777 was used. This reduced the small dataset from 59,000 rows to approximately 54,017.
For review, the removed records were stored in an Excel spreadsheet data/censored/multipleSamples.xls
This censoring removed the zero values questioned above and the new binary files have been replaced in the repository.
— Reply to this email directly or view it on GitHub https://github.com/CoWy-ASA/RiverWatch/issues/9#issuecomment-104518434.
Barb HornStatewide Water Quality Specialist Water Unit
P 970.382.6667 | F 970.247.4785 151 East 16th Avenue, Durango, CO 81301 barb.horn@state.co.us | cwp.state.co.us
"Humankind, be both"
Hi Barb,
Would it be helpful to talk about this for a few minutes next week? If there is a way we can help CDSN check their data, I would like to do so. As you saw in my recent note - I have been preoccupied the last week or so - next week is more calm.
Thanks,
Matt
Date: Fri, 29 May 2015 14:24:39 -0700 From: notifications@github.com To: RiverWatch@noreply.github.com CC: Matt_Pocernich@hotmail.com Subject: Re: [RiverWatch] Bad data?/ Shiny App (#9)
After further investigation of the apparent duplicates I am confused and
believe it is a DSN issue. We cannot find any results for the records you
pulled out in the spread sheet. We reproduced and upload of those same
stations/events and data is not duplicated- assume data we see is the
record you kept.
We did do a complete renewal upload last year to STORET via AWQMS/CDSN. At
EPA national STORET, their warehouse--they deleted all our records before
we uploaded. AWQMS did not---and it appears that old data was not written
over and/or deleted (still not sure where those values came from).
Many of those event numbers, but not all, had dup and blank data associated
with it and are in the era we migrated data from dbase to SQL database.
Here is my current theory on this, which may not matter to you all. Last
year we did a complete new upload. STORET del everything and AWQMS did
not. So, I am testing that theory right now. STORET should not produce
the same output then as CDSN.
Barb
On Thu, May 21, 2015 at 11:41 PM, mjpdenver notifications@github.com
wrote:
I have made the following observations and censoring analyzing the
small_dat set of common analytes and sites.
1.
Small_dat has 60,000 records. For 1,447, Result.Values is less than
the Detection.Quantitation.Limit.Value1 and greater than zero. This is not
a problem if detected values less than the detection limit are reported.
This could be an issue if only sometimes values less than detection limit
are reported as measured - but other times they are reported as 0. No
Action Taken.
2.
For 3,142 events - more than one record exists for each
date/site/analyte. Infact, the records are identical across every field
except for the result (sometimes) and the Result.UID. Looking closely at
these records, it is plausible that some of these events collected
duplicates and a blank. For example, one might see 2 values of calcuim
around 82 mg/l and one reported as zero.
For these events, the maximum value of the 2 or three records was taken
and a flag = 777 was used. This reduced the small dataset from 59,000 rows
to approximately 54,017.
For review, the removed records were stored in an Excel spreadsheet
data/censored/multipleSamples.xls
This censoring removed the zero values questioned above and the new binary
files have been replaced in the repository.
—
Reply to this email directly or view it on GitHub
https://github.com/CoWy-ASA/RiverWatch/issues/9#issuecomment-104518434.
Barb HornStatewide Water Quality Specialist
Water Unit
P 970.382.6667 | F 970.247.4785
151 East 16th Avenue, Durango, CO 81301
barb.horn@state.co.us | cwp.state.co.us
"Humankind, be both"
— Reply to this email directly or view it on GitHub.
Be happy to chat. I will be in the annual Clean Water Act Rule making hearing Mon.Tue- the basin of focus this hearing is the S Platte--and RW data is use in all those----but We-Fri I am available most times. CDSN itself will not 'care' about the data if you will--their model follows EPA STORET and I was a founder in creating CDSN---the model puts the quality of the data on the owner of the data. That said, their system can cause issues.
I spent last week w/ my temp exploring what the issue(s) are. We have a theory we are testing- but depends on help from Gold Systems where CDSN hosted. The theory is this---we did an entire new upload last year to CDSN which then went onto STORET (two storage places if you will). STORET staff deleted all legacy RW data before this, CDSN did not. That means data that was not overwritten by an activity id, would still be in CDSN but not STORET if we didn't go in and delete stuff too. This would explain a bunch of the errors, duplicates and such.
I have already corrected outliers and bad data you found and we will be uploading that "replacement" data with this years upload. We are waiting to see if we will do an entire overwrite again this year (based on GS help and my temps time, he is done mid July -- or wait until next year. In either case we will be cleaning this up ---it might take 9 months from A to Z but will. Next year (fall) we will start work on a new updated RW database too. That work, building a new database up to standards and hosting in the cloud and migrating all existing data over---will greatly improve the frequency of updates to CDSN from RW and for data to feed into other applications. I have been waiting 10 years for this....
Let me know a good time and thanks for all you all do.
On Wed, Jun 3, 2015 at 10:20 PM, mjpdenver notifications@github.com wrote:
Hi Barb,
Would it be helpful to talk about this for a few minutes next week? If there is a way we can help CDSN check their data, I would like to do so. As you saw in my recent note - I have been preoccupied the last week or so - next week is more calm.
Thanks,
Matt
Date: Fri, 29 May 2015 14:24:39 -0700 From: notifications@github.com To: RiverWatch@noreply.github.com CC: Matt_Pocernich@hotmail.com Subject: Re: [RiverWatch] Bad data?/ Shiny App (#9)
After further investigation of the apparent duplicates I am confused and
believe it is a DSN issue. We cannot find any results for the records you
pulled out in the spread sheet. We reproduced and upload of those same
stations/events and data is not duplicated- assume data we see is the
record you kept.
We did do a complete renewal upload last year to STORET via AWQMS/CDSN. At
EPA national STORET, their warehouse--they deleted all our records before
we uploaded. AWQMS did not---and it appears that old data was not written
over and/or deleted (still not sure where those values came from).
Many of those event numbers, but not all, had dup and blank data associated
with it and are in the era we migrated data from dbase to SQL database.
Here is my current theory on this, which may not matter to you all. Last
year we did a complete new upload. STORET del everything and AWQMS did
not. So, I am testing that theory right now. STORET should not produce
the same output then as CDSN.
Barb
On Thu, May 21, 2015 at 11:41 PM, mjpdenver notifications@github.com
wrote:
I have made the following observations and censoring analyzing the
small_dat set of common analytes and sites.
1.
Small_dat has 60,000 records. For 1,447, Result.Values is less than
the Detection.Quantitation.Limit.Value1 and greater than zero. This is not
a problem if detected values less than the detection limit are reported.
This could be an issue if only sometimes values less than detection limit
are reported as measured - but other times they are reported as 0. No
Action Taken.
2.
For 3,142 events - more than one record exists for each
date/site/analyte. Infact, the records are identical across every field
except for the result (sometimes) and the Result.UID. Looking closely at
these records, it is plausible that some of these events collected
duplicates and a blank. For example, one might see 2 values of calcuim
around 82 mg/l and one reported as zero.
For these events, the maximum value of the 2 or three records was taken
and a flag = 777 was used. This reduced the small dataset from 59,000 rows
to approximately 54,017.
For review, the removed records were stored in an Excel spreadsheet
data/censored/multipleSamples.xls
This censoring removed the zero values questioned above and the new binary
files have been replaced in the repository.
—
Reply to this email directly or view it on GitHub
<https://github.com/CoWy-ASA/RiverWatch/issues/9#issuecomment-104518434 .
Barb HornStatewide Water Quality Specialist
Water Unit
P 970.382.6667 | F 970.247.4785
151 East 16th Avenue, Durango, CO 81301
barb.horn@state.co.us | cwp.state.co.us
"Humankind, be both"
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/CoWy-ASA/RiverWatch/issues/9#issuecomment-108715229.
Barb HornStatewide Water Quality Specialist Water Unit
P 970.382.6667 | F 970.247.4785 151 East 16th Avenue, Durango, CO 81301 barb.horn@state.co.us | cwp.state.co.us
"Humankind, be both"
Hi,
I have made a few time series plots for the more commonly measured analytes at the more popular sites. For reasons discussed earlier - I also am only looking at data collected after 1999.
I made a simple shiny plot and deployed it at shinyapp.io. There left panel has a drop down menu to choose and analyte.
(https://mjpdenver.shinyapps.io/shinyPlot/) NOTE: THE FIRST TIME YOU OPEN THE PAGE YOU MAY HAVE TO WAIT 5 or 10 SECONDS for the DATA to Load.
Looking through the plots, I see what I assume are bad (or really interesting) results. For example, for Calcium, at the first site we see a very high value. For pH, there is a value of 22 and one of ~1.5.
Flipping through the analytes, I so notice that prior to 2005, zero value are more often measured. Again, looking at Calcium - is it possible that river samples could have no measurable calcium? My aqueous chemistry is a bit rusty, but isn't some calcium in nearly all river water.
We could - but I don't think this is the place - try and identify outliers. But that would be an project unto itself and shouldn't be done by statisticians who don't have knowledge of river water chemistry. Moreover, this seems to be more an issue to address with the Colorado Data Sharing Network.
Any ideas?