Conte-Ecology / conteStreamTemperature_web

Description and scripts for running the temperature model through the SHEDS web application
MIT License
0 stars 0 forks source link

Handling Error Flags #18

Open djhocking opened 9 years ago

djhocking commented 9 years ago

I am writing functions to flag potential errors in the data. Then functions to filter or process certain ones for use in the analysis. What is the best way to do this if we want to get the flags into the database (or do we)? I remember discussing this and deciding that we wouldn't alter what was in the database (just store original) but we would create flags of potential problems. If I create flags associated with records, what is the best way to output them for the database? Do we even want to bother putting flags in the database or just keep all that processing separate?

walkerjeffd commented 9 years ago

Well if we want to flag individual values, then you need to associate the flag with the id column of the vales table. So make sure you include that column when you fetch the values from the database. Then have each flag associated with an id (not just the series/location/datetime of the values).

Are these functions all going to be automated or do they require some manual checking to confirm which values get flagged?

Ideally, there would be a set of functions that would check every data point and automatically flag values that should be excluded from the model run. Then for each model run, we would just run all of the functions on all of the input data. Even though this would mean checking the same data multiple times, this seems necessary to me since the users will be able to add/remove data. So what happens if someone deletes a dataset and reuploads it (maybe they found a mistake and changed some of the values, or they accidentally upload the same dataset in separate files). Bottom line is that given the users can change their own datasets, we have no guarantee that the id's associated with each value will remain constant.

But in the short term, I think what I would do is just save a dataframe with columns [value_id, flag] to an RData file. The you can join that to the dataset fetched from the database each time you want to run the model. And then update that RData file of flagged values whenever you run checks on new data.

We could store this all in the database, but you'd need to write UDPATE SQL statements to update the values table. And doing much updating of the data tables in the db would make me nervous. Seems safer to me to only let the users modify their own data.

djhocking commented 9 years ago

Thanks. It sounds like there might not be much value in storing flags in the database at the moment. I guess it could be useful for managers finding errors and potentially correcting them or preventing them in the future. I'll keep the functions general for the time being but will plan to run them all on all the input data during the analysis.

bletcher commented 9 years ago

Need to think, though, about how we show the flagged points on the graphs - both the user-flagged and the algorithm-flagged points. My sense is that it would be good to have a table in the dB for flagged points and to run the algo before each model update.

On Tue, Mar 10, 2015 at 10:03 PM, Daniel J. Hocking < notifications@github.com> wrote:

Thanks. It sounds like there might not be much value in storing flags in the database at the moment. I guess it could be useful for managers finding errors and potentially correcting them or preventing them in the future. I'll keep the functions general for the time being but will plan to run them all on all the input data during the analysis.

— Reply to this email directly or view it on GitHub https://github.com/Conte-Ecology/conteStreamTemperature_web/issues/18#issuecomment-78186961 .

Silvio O. Conte Anadromous Fish Research Center, U.S. Geological Survey P.O. Box 796 -- One Migratory Way Turners Falls, MA 01376 (413) 863-3803 Cell: (413) 522-9417 FAX (413) 863-9810

ben_letcher@usgs.gov bletcher@eco.umass.edu http://www.lsc.usgs.gov/?q=cafb-research

walkerjeffd commented 9 years ago

hmm, yeah I wasn't thinking we would show the algorithm flags to the users. I guess that would be useful and make sense though. So then we would store the [id, flagged] table in the database (separate from the values table), and overwrite it every time the flagging algorithm is run. Then we'll also keep the flagged column in the values table to hold user-flags. Then we can just differentiate the two types of flags on the plot using different colors. Although this will get a little tricky when computing daily means for the plot (e.g. if only some of the values within a day are flagged, then the whole day should be flagged?). And what if a particular day of values has both types of flags?

Maybe we do this:

1) Users can flag values through the interface (these flags are stored in values.flag column) 2) When creating an input dataset for the model, we would only fetch the values that do not have a user-flag 3) Run the QAQC algorithm on non-user-flagged data, and save algorithm-flag to a separate table in database (maybe we'll call it values_qaqc with columns id and flag where id is a foreign key to values.id) 4) The resulting data that passed the QAQC algorithm (and thus has neither user-flag nor algorithm-flag values) would then be used as input to the model

djhocking commented 9 years ago

I think that’s a good plan. You’re right that a challenge will be dealing with daily vs. sub daily data and associated errors. I suppose the user flags in the values table can currently only be on the daily tilmestep since that is all users can see. All of the QAQC I’m doing is intended to be automated (I can’t be looking at individual time series). However, there are some steps at the sub daily level (e.g. > 3C change in stream temperature in an hour) and some at the daily level after I aggregate.

My vision in ideal world would be to QAQC with every check getting it’s own column, then when aggregated to daily for display any day with any flags would get a flag and be displayed as a different color (the table could be in long format in theory, but that’s less straight forward for my way of thinking). Then if the user clicked on that point it would display the list of flags (flag columns) for that day and if they double clicked a time series at the finest resolution would be displayed for that day.

I assume that is all possible, the question being how much of this is worthwhile since it would involve work on the R scripts, database, and front end.


Daniel J. Hocking

dhocking@unh.edumailto:dhocking@unh.edu http://danieljhocking.wordpress.com/


On Mar 11, 2015, at 9:25 AM, Jeff Walker notifications@github.com<mailto:notifications@github.com> wrote:

hmm, yeah I wasn't thinking we would show the algorithm flags to the users. I guess that would be useful and make sense though. So then we would store the [id, flagged] table in the database (separate from the values table), and overwrite it every time the flagging algorithm is run. Then we'll also keep the flagged column in the values table to hold user-flags. Then we can just differentiate the two types of flags on the plot using different colors. Although this will get a little tricky when computing daily means for the plot (e.g. if only some of the values within a day are flagged, then the whole day should be flagged?). And what if a particular day of values has both types of flags?

Maybe we do this:

1) Users can flag values through the interface (these flags are stored in values.flag column) 2) When creating an input dataset for the model, we would only fetch the values that do not have a user-flag 3) Run the QAQC algorithm on non-user-flagged data, and save algorithm-flag to a separate table in database (maybe we'll call it values_qaqc with columns id and flag where id is a foreign key to values.id) 4) The resulting data that passed the QAQC algorithm (and thus has neither user-flag nor algorithm-flag values) would then be used as input to the model

— Reply to this email directly or view it on GitHubhttps://github.com/Conte-Ecology/conteStreamTemperature_web/issues/18#issuecomment-78261679.

walkerjeffd commented 9 years ago

Yeah, that makes sense to me. Maybe the first step would be if you could create some example plots using ggplot for some of the data you've already flagged. And maybe create a definition list of the different types of flags/checks your running. That'll make it easier for me to then figure out how to structure it in the DB and create the plots in d3.

On Wed, Mar 11, 2015 at 9:47 AM, Daniel J. Hocking <notifications@github.com

wrote:

I think that’s a good plan. You’re right that a challenge will be dealing with daily vs. sub daily data and associated errors. I suppose the user flags in the values table can currently only be on the daily tilmestep since that is all users can see. All of the QAQC I’m doing is intended to be automated (I can’t be looking at individual time series). However, there are some steps at the sub daily level (e.g. > 3C change in stream temperature in an hour) and some at the daily level after I aggregate.

My vision in ideal world would be to QAQC with every checking getting it’s own column, then when aggregated to daily for display any day with any flags would get a flag and be displayed as a different color (the table could be in long format in theory, but that’s less straight forward for my way of thinking). Then if the user clicked on that point it would display the list of flags (flag columns) for that day and if they double clicked a time series at the finest resolution would be displayed for that day.

I assume that is all possible, the question being how much of this is worthwhile since it would involve work on the R scripts, database, and front end.


Daniel J. Hocking

dhocking@unh.edumailto:dhocking@unh.edu http://danieljhocking.wordpress.com/


On Mar 11, 2015, at 9:25 AM, Jeff Walker <notifications@github.com<mailto: notifications@github.com>> wrote:

hmm, yeah I wasn't thinking we would show the algorithm flags to the users. I guess that would be useful and make sense though. So then we would store the [id, flagged] table in the database (separate from the values table), and overwrite it every time the flagging algorithm is run. Then we'll also keep the flagged column in the values table to hold user-flags. Then we can just differentiate the two types of flags on the plot using different colors. Although this will get a little tricky when computing daily means for the plot (e.g. if only some of the values within a day are flagged, then the whole day should be flagged?). And what if a particular day of values has both types of flags?

Maybe we do this:

1) Users can flag values through the interface (these flags are stored in values.flag column) 2) When creating an input dataset for the model, we would only fetch the values that do not have a user-flag 3) Run the QAQC algorithm on non-user-flagged data, and save algorithm-flag to a separate table in database (maybe we'll call it values_qaqc with columns id and flag where id is a foreign key to values.id) 4) The resulting data that passed the QAQC algorithm (and thus has neither user-flag nor algorithm-flag values) would then be used as input to the model

— Reply to this email directly or view it on GitHub< https://github.com/Conte-Ecology/conteStreamTemperature_web/issues/18#issuecomment-78261679>.

— Reply to this email directly or view it on GitHub https://github.com/Conte-Ecology/conteStreamTemperature_web/issues/18#issuecomment-78265479 .

Jeffrey D. Walker, PhD http://walkerjeff.com

djhocking commented 9 years ago

I am not having great success with automated QAQC. I get too many false negatives or false positives depending on the thresholds I set. I am going to have to take a step back and work on some other stuff for now. I think a way forward that would be easiest for me getting a robust dataset for modeling would be:

  1. Start with flag for all series_id in public.series table
  2. Allow users to flag specific values and store those flags in the public.values table (we could provide a specific set of things to look for at a minimum)
  3. Once the user has checked the time series and flagged any specific values the series level flag will be removed (i.e. they approve the time series as valid for use with the exception of the specifically flagged values).

This is similar to what's described in Jeff's original plan but with the distinction between a series and values level flag for the user.

We know that everything Kyle had was checked already. I can talk to Gerry about the MA DEP data, but I expect that has already been check thoroughly. These series could be marked with a series flag = "FALSE" from the start. Then the above would apply to new data uploaded.

The question is what to do with all the other CT and ME data that we've received more recently. Do those get flagged until someone checks them all (the original users and/or us)?

For any automated flagging that I get working reasonably well in R, those could then proceed as described above. Basically, adding different flag columns and relating them by values.id

djhocking commented 9 years ago

Jeff - I will get to the definitions list, plots, and examples after focusing on some other stuff.