Clean Data Export - Githubissues

lukeloken commented 8 years ago

Is it possible to create a 'clean' function that removes the flagged observations.

Suggestions: Add an argument "which" that can either be "ALL" or a list of rules with the default being ALL. Add an argument that simply deletes the whole row.

clean(flaggedobject, which=ALL, replace=NA) clean(flaggedobject, which=c(1,2,4), replace=99999) clean(flaggedobject, which=ALL, deleterow=TRUE)

jordansread commented 8 years ago

Cool, thanks @lukeloken. Great ideas.

jordansread commented 8 years ago

the one change I might suggest is:

clean(flaggedobject, which='ALL', replace=NA) # to replace w/ NA
clean(flaggedobject, which='ALL', replace=NULL) # to delete row

and the ability to do this:

clean(object, which, ..., replace) #as the function arguments, which would support this:
clean(object, 'x > 9999', 'persist(x) > 10', 'MAD(x) > 3', replace=NA)

Thoughts @lukeloken and @aappling-usgs ?

lukeloken commented 8 years ago

I like the replace=Null option.

Do you want the ability to 'clean' an unflagged object?

clean(object, 'x > 9999', 'persist(x) > 10', replace=NA)

Or is it more straightforward to run two commands?

flagged <- flag(object, 'x > 9999', 'persist(x) > 10')
cleaned <- clean(flagged, which=ALL, replace=NA)

jordansread commented 8 years ago

I like that option personally, because then I can use sensorQC to really rapidly clean a vector or data.frame with one line. The key is having it be a good thing for powerusers but not get in the way of a more normal workflow.

lukeloken commented 8 years ago

I like having both options, but I don't know how difficult this is to code.

jordansread commented 8 years ago

oh yeah, for sure, I was thinking both.

lukeloken commented 8 years ago

I'm have another question for you Jordan: How you would recommend removing a single observation that is clearly an outlier if there are also NA values. I think the MAD function does not work properly if there are NA values also in the series.

  dates <- seq(as.POSIXct('1999-01-01'),by=1,length.out=16)
  values1 <- c(runif(10,2,4),10000, runif(5,2,4))
  values2 <- c(runif(10,2,4),10000, NA, runif(4,2,4))
  data.in1 <- data.frame("DateTime"=dates,"sensor.obs"=values1)
  data.in2 <- data.frame("DateTime"=dates,"sensor.obs"=values2)
  flag(data.in1, "is.na(x)", 'MAD(x) > 3')
  flag(data.in2, "is.na(x)", 'MAD(x) > 3')

The two {flag} commands do not treat the 10000 observation the same.

jordansread commented 8 years ago

whoops, that is a bug. Good catch! https://github.com/USGS-R/sensorQC/issues/37

Also note that you can do:

values1 <- c(runif(10,2,4),10000, runif(5,2,4))
values2 <- c(runif(10,2,4),10000, NA, runif(4,2,4))
flag(values1, "is.na(x)", 'MAD(x) > 3')
flag(values2, "is.na(x)", 'MAD(x) > 3')

for really simple quick qaqc

jordansread commented 8 years ago

@lukeloken you should be able to try this after doing another devtools::install_github('usgs-r/sensorQC')

check out the readme for the pattern: https://github.com/USGS-R/sensorQC/blob/master/README.md

Can you let me if that is what you expected? (note that I changed 'ALL' to 'all' as an argument just to be consistent with other packages that is used in)

lukeloken commented 8 years ago

Jordan, I think this works great. The only comment I have is in regard to the output of {clean}. Is there any reason to keep the structure of this a list rather than unlisting it to a data.frame class? This would save a line of code after processing.

jordansread commented 8 years ago

I think I need to think on that... In the meantime, clean(bla bla)$sensor should get you a data.frame, right?

lukeloken commented 8 years ago

No rush, I was just curious if there was a reason to keep the object class. ...$sensor and [[1]] get to the data.frame.

One reason to keep it as is - It will only print the first 15 observations. Which is very handy.

jordansread commented 8 years ago

yeah, another little trick to that, is you can use

print(sensor, 20)

to print any number of rows. 15 is the default.

In the future, I want the sensor w/o any flags (so the thing you get back from clean) to behave just like a data.frame, other than the print method. Cool?

lukeloken commented 8 years ago

Sounds good. That will make it easiest to merge multiple 'cleaned' objects stemming from the same original data file.

lukeloken commented 8 years ago

Thanks Jordan. The next step for me will be building individual rules for each parameter. Essentially, I'd like to build a table of parameters and rules

variable, rule1, rule2, etc. pH, 'x > 10', 'x < 4', 'MAD(x) > 3', CO2, 'x < 0', 'x > 4000', 'MAD(x) > 2', var3, etc

and then run a loop to apply each set of rules to the appropriate parameter

      file<-read(...)
      for (col in 2:ncol(file)){
        data<-file[,c(1,col)]
        rules<- function(something that looks at names(data[2]) and references the table above
        newdata<-clean(data, rules, replace=NA)
        if (col == 2){
          finaldata<-newdata}
        if (col>2){
          finaldata<-merge(finaldata, newdata, by="times", all=TRUE)}
      }

If you have any suggestions, I'm all ears.

jordansread commented 8 years ago

looks fine to me. Probably need

newdata<-clean(data, rules, replace=NA)$sensor %>%
  setNames('times',variable)

though to get the merge to work. Otherwise all of the data are going to have the same variable names ('x').

This is probably a more common use case than I imagined it would be originally, so let me know what the pain-points are if you have any.

lukeloken commented 8 years ago

Hey Jordan,

I think I'm getting somewhere with the automatic code. I may have found another bug in the {clean} function. It does not appear to accept data.frame objects

dates <- seq(as.POSIXct('1999-01-01'),by=1,length.out=500)
values1 <- c(runif(100,-5,12),10000, runif(397,2,100), NA, -9)
file <- data.frame("DateTime"=dates,"sensor.obs1"=values1)
clean(file, 'x > 9999', 'persist(x) > 10', replace=NA)

Is there a quick way to convert a data.frame into a sensor object? Otherwise I may be better off running vectors through {clean}, and simply binding them together. However this makes me somewhat nervous in case the vectors are different lengths.

#Make Dataset
dates <- seq(as.POSIXct('1999-01-01'),by=1,length.out=500)
values1 <- c(runif(100,-5,12),10000, runif(397,2,100), NA, -9)
values2 <- c(runif(397,2,18), NA, -9, runif(100,-2,36),10000)
file <- data.frame("DateTime"=dates,"sensor.obs1"=values1, "sensor.obs2"=values2)

#Set rules
Variables=names(file)[-1]
Rule1<-c('x < 0', 'x < 0')
Rule2<-c('MAD(x) > 3', 'MAD(x) > 2')
RuleTable<-data.frame(Variables, Rule1, Rule2, stringsAsFactors=FALSE)

col=2
for (col in 2:ncol(file)){
  data<-file[,c(1,col)]
  name<-names(data)[2]
  rules<- unlist(RuleTable[which(RuleTable[,1]==name), 2:3])
  flagged<-flag(data, rules)
  newdata<-clean(flagged, replace=NA)$sensor
  # newdata<-clean(data, rules, replace=NA)$sensor
  names(newdata)<-c('times',name)

  if (col == 2){
    finaldata<-newdata}
  if (col>2){
    finaldata<-merge(finaldata, newdata, by="times", all=TRUE)}
}

jordansread commented 8 years ago

whoops, @lukeloken I hadn't created the method for clean that accepts a data.frame yet. Expect that in the near(ish) future.

For now, the simple way to go from data.frame to sensor is:

sensor(data.frame(c(1),c(1)))

jordansread commented 8 years ago

As soon as this https://github.com/USGS-R/sensorQC/pull/40 is merged, you can do clean(data.frame...). I would suggest keeping them as data.frames so you don't lose time, as that is what you are going to join by.

lukeloken commented 8 years ago

Hey Jordan,

Thanks for plugging away with SensorQC. I think we are getting somewhere and I can see the future.

Quick question about MAD. How exactly does the code work? I know what median absolute deviation is, but how are each of the observations compared to this value?

Hope you have a good weekend.

Luke

On Fri, Oct 23, 2015 at 3:34 PM, Jordan S Read notifications@github.com wrote:

As soon as this #40 https://github.com/USGS-R/sensorQC/pull/40 is merged, you can do clean(data.frame...). I would suggest keeping them as data.frames so you don't lose time, as that is what you are going to join by.

— Reply to this email directly or view it on GitHub https://github.com/USGS-R/sensorQC/issues/36#issuecomment-150682799.

lukeloken commented 8 years ago

Nevermind, I think I found the correct code.

abs(x - median(x)) / mad(x)

jordansread commented 8 years ago

LEYS 2013 Detecting outliers - Do not use standard deviation around the mean, use absolute deviation around the median.pdf

lukeloken commented 8 years ago

Above Figure, pH from Lake Mendota. Red values= raw data, Black = 'cleaned' data after running MAD(x) > 3

So I have been running the MAD and some basic min/max expressions to remove outliers. In this example, the MAD function may not be appropriate as it excludes values when we sampled for extended periods of time in different areas of the lake (see above). This occurs because we drove our boat into Cherokee Marsh where the pH was drastically different than the rest of the lake.

I'm thinking of additional rules to apply to {flag} and {clean} functions. One method that comes to mind is applying the MAD function of other statistical tests over a rolling window. Something similar to {runmean} in the 'caTools' package. Any thoughts on approaches to using a local outlier test rather than a global one? I realize this will increase the processing time, but it may be more appropriate for data strings that oscillate among multiple values. I could also see this being valid for stationary timeseries (e.g., rain event trigger high NO3 and discharge for several days).

jordansread commented 8 years ago

yeah, MAD should be used with a window if the data aren't expected to be stationary, so pretty much always.

You can use MAD with window as MAD(x,w) > 3, but you have to window the data first. There is an example of that in the readme, but I don't think I have implemented the manual windowing yet, which is probably what you want. It is another column in the sensor called 'w'.

lukeloken commented 8 years ago

Hey Jordan,

I've been able to get the window function to work, but only with type='auto' and after generating a sensor object. It would be great to be able to 1) Window a data.frame object and 2) specify the window width (type=10~50 measurements or seconds). Hopefully we can pass this windowed object through {clean}/{flag} with 'MAD(w)>3'.

I'm curious how your window procedure works? Specifically does it look both directions and how does it handle the last/first observations?

# col = column number of data values (2:ncol(datatable)
data<-datatable[,c(1,col)]
sensored<-sensor(data)
windowed<-window(sensored, type=10)
cleanedframe[,col]<-clean(windowed, 'MAD(w)>3', replace=NA)$sensor[,2]

jordansread commented 8 years ago

@lukeloken window currently just windows data based on breaks in time, which is the auto argument. Manual is supposed to take a time argument for window spacing (this would be more appropriate for continuous data).

Another option should be to pass in a vector of window indices. I don't know if you looked at the sensor$sensor$w vector, but it is just a vector of indices that are used to group observations. Passing in something to take w vector should be supported, that way you can easily customize your own windows (maybe you w/ FLAME would do them more with space than time...).

Which is a priority to be implemented first? manual windowing based on time, or adding your own w vector? Or other ideas here?

lukeloken commented 8 years ago

@jread-usgs I see what your current window function does. It is designed for burst data in which you capture 30 secs of 1hz data every 30 minutes. This obviously does not work for flame data, which is continually collecting at 1hz.

I would vote for building customized exclusion vectors. I've started a function below to compute a rolling MAD, but you probably have a faster/better way to do it. This is computationally heavy, but it seems like a good place to start (at least for my type of data). If the output of this function became the w slot, we could use 'w > 3' to flag.

One problem I've already encountered is with lots of observations of the same value. For example, In many of the windows for pH, more than half the observations are identical, making mad=0 and MAD=Inf. Certainly don't want to exclude observations for this reason.

x <- data[,2] # list of x values
window <- 20 # size of window in both directions to compute MAD

#Rolling MAD function. Windows subsequent observations and computes MAD statistic from local observations

rollingMAD<-function (x, window){

# Empty vector for computed MAD values
WindowedMAD<-rep(NA, length(x)) 

for (obs in 1:length(x)){
  # Set interval of observations to compute MAD
  # For observations near start and end only include real observations
  seq<-(obs-window):(obs+window)
  seq<-seq[which(seq<=length(x) & seq>=1)]
  interval<-x[seq]

  # Compute MAD
  localmedian<-median(interval, na.rm=TRUE)
  localmad<-mad(interval, na.rm=TRUE)
  WindowedMAD[obs]<-abs(x[obs]-localmedian)/localmad
}

return (WindowedMAD)
}

We might also think about including some of the running sd code from CaTools page 35 https://cran.r-project.org/web/packages/caTools/caTools.pdf

lukeloken commented 8 years ago

Made some progress on the rollingMAD function and cleaning a flame dataframe. {clean(x, 'w>3'} will work after loading the local MAD values into the 'w' slot.

  sensored$sensor$w <- rollingMAD(sensored$sensor$x, window)
  cleaned <- clean(sensored, rules, replace=NA)$sensor[,2]

No rush on looking at these, as it seems to be working at the moment. In the future, I'd be interested in how to increase processing speed.

# ==========================================================
# Rolling MAD function. 
# Windows subsequent observations and computes local MAD statistic
# ==========================================================

rollingMAD<-function (x, window){

  # Create empty vector for computed MAD values
  WindowedMAD<-as.numeric(rep(NA, length(x)))

  for (obs in 1:length(x)){
    # Set interval of observations
    # truncate interval for observations near start and end
    seq<-c((obs-window):(obs+window))
    seq<-seq[which(seq<=length(x) & seq>=1)]
    interval<-x[seq]

    # Compute MAD
    localmedian<-median(interval, na.rm=TRUE)
    localmad<-mad(interval, na.rm=TRUE)
    WindowedMAD[obs]<-abs(x[obs]-localmedian)/localmad
  }

  # Replace infinite values (mad==0) with NA
  WindowedMAD[which(is.infinite(WindowedMAD))]<-NA

  # Return vector of windowed MAD values.
  # Equal length to x
  return (WindowedMAD)
}

# ==========================================================
# Function to clean multiple parameters of a dataframe
# Col1 = times; col2 and up == variables
# Requires a data.frame of rules that match datatable column names
# window = number of observations to include in both directions for rollingMAD
# ==========================================================

sensorclean<-function(datatable, ruletable, window){

# Create emtpy datatable to fill with cleaned values. 
# Keep only column 1 (times)
cleanedframe<- datatable
cleanedframe[,-1]<-NA

#Clean all variables in datatable starting with column 2
for (col in 2:ncol(datatable)){

  # Create sensor object and fill 'w' slot with rollingMAD
  data<-datatable[,c(1,col)]
  sensored<-sensor(data)
  sensored$sensor$w<-rollingMAD(sensored$sensor$x, window)

  # Set rules manually or from ruletable and clean sensor object
  # rules<-as.character(c('w>3', 'is.na(x)'))
  rules<- unlist(ruletable[which(ruletable[,1]==names(data)[2]), -1])
  if (length(rules)==0){
    cleanedframe[,col]<-data[,2]}
  if (length(rules)>0){  
    cleanedframe[,col]<-clean(sensored, rules, replace=NA)$sensor[,2] 

    # Plot flagged (red) and retained (black) observations 
    flags<-data[,2]
    flags[which(flags==cleanedframe[,col])]<-NA
    plot(cleanedframe[,col], col="black", ylab=names(data)[2], xlab="time", ylim=range(data[,2], na.rm=TRUE))
    points(flags, col="red", pch=16)
  }
}

return(cleanedframe)
}

# =====================================================
# Workflow to clean FLAMe data after loading above functions
# =====================================================

# setwd("C:/Users/Luke/Dropbox/FLAME/Data/2015-08-24_SparklingLake")
datatable<-read.csv("SparklingLake2015-08-24.csv", header=TRUE, stringsAsFactors=FALSE)
datatable$ltime<-as.POSIXct(datatable$ltime, format="%Y-%m-%d %H:%M:%S", tz="UTC")
ruletable<-read.csv("C:/Users/Luke/Dropbox/FLAME/SensorQC/SensorQCRules_2015.csv", header=TRUE, stringsAsFactors=FALSE)
window <- 15 # size of window in both directions to compute MAD

CleanFLAME<-sensorclean(datatable, ruletable, window)

Example output from {sensorclean}. Red=removed values Not perfect, but it's a start.

jordansread commented 8 years ago

Nice! Looking good. Thanks for taking a stab at this. Happy to see if I can help the speed-up.

jordansread commented 8 years ago

@lukeloken how slow is this for you? Can you give me an indication of the processing time it takes for a reasonably sized dataset so I can see where the starting point is?

lukeloken commented 8 years ago

On my Corei5 with 6gb RAM The dimensions of datatable are 30210 by 47. This is one of our larger data files Is there somewhere online I can drop my ruletable and datatable so you can play with my code?

It takes about 14s to perform {rollingMAD} on one column of data.

sensored$sensor$w<-rollingMAD(sensored$sensor$x, Medianwindow, MADwindow)

To loop through all 46 columns and replace with NA using {sensorclean} takes about 5 minutes. This can obviously be shortened by not plotting, but for first pass it seems useful to see what is removed.

CleanFLAME<-sensorclean(datatable, ruletable)

Hilary just suggested looking at the {tsoutlier} package.

library(tsoutliers)
fit <- arima(data[,2], order = c(1, 1, 0))
resid <- residuals(fit)
pars <- coefs2poly(fit)
outliers <- locate.outliers(resid, pars,types = c('AO'),cval=25)

plot(data[,1],data[,2],cex=0.5,pch=19,col='red', ylab=names(data)[2], xlab="")
points(data[outliers$ind,1],data[outliers$ind,2],cex=0.5,pch=19,col='black')

Also a little slow, but might be another method for assigning flags.

jordansread commented 8 years ago

cool, I hadn't seen tsoutliers, looks like a good package. I am going to take a closer look at that.

I am trying out your code/data now.

jordansread commented 8 years ago

harder than I thought to make this fast. Was trying to dplyr::lead() or lag before going into MAD, but not having much luck with that (although it would be really fast if it worked).

jordansread commented 8 years ago

RcppRoll::roll_median(datatable$ChlARFU,30) is a fast potential solution. Looking into how flexible that is

jordansread commented 8 years ago

Rccp::rollit_raw let's you roll your own.

lukeloken commented 8 years ago

And I love rolling it raw!

jordansread commented 8 years ago

@lukeloken I will have something for you to test (if you are up for it) shortly.

jordansread commented 8 years ago

1.5 hrs...

lukeloken commented 8 years ago

Happy to be a guinea pig

jordansread commented 8 years ago

Give it a shot? https://github.com/USGS-R/sensorQC/pull/42

you can devtools intstall_github from my fork (jread-usgs) and get this feature. See the readme here for an example on using the rolling window. It is centered right now by default.

lukeloken commented 8 years ago

Hey Jordan,

The code works and is much faster than my previous version. Is there any way I can see the code used to to calculate {MAD.roller}

A few issues: I can tell that if MAD==Inf, the value becomes flagged. Also, I'm finding that the threshold of 3 does not work well for a lot of the data. I've increase the threshold to 10 and it does a reasonable job for some parameters but poorly for others (see figures)

I'm now starting to wonder how useful the MAD function is for moving data. Perhaps other outlier detection methods are needed.

Code to plot and assess rolling window outlier detection

for (col in 2:ncol(datatable)) {

sensor<-sensor(datatable[,c(1,col)])

sensor2<-window(sensor, n=40, type='rolling')
flagged<-flag(sensor2, 'x == 999999', 'MAD(x,w) > 10', 'MAD(x) > 3')

outliers<-flagged$flags[[2]]$flag.i
flagname<-flagged$flags[[2]]$expression

# plot(sensor2)
par(mar=c(4,4,1,1))
plot(sensor2$sensor$times, sensor2$sensor$x, col="black", type="p", pch=1, cex=1,main=names(datatable)[col], ylab=names(datatable)[col], xlab="")
points(sensor2$sensor[outliers, ], col="red", pch=16, cex=1)
legend("topleft", c(flagname), col="red", pch=16, cex=1, bty="n")

}

jordansread commented 8 years ago

wow those are some pretty figs :wink:

Code for MAD.roller is here

I put that warning in there because I haven't really tested it w/ different things (like a bunch of NAs or Inf).

Do you think that rolling window is big enough? I was having good luck with much longer periods - like 100 or 300, but that is for station data.

There is also an issue here when data have trends. When doing windows in the past, I have detrended and then operated on it. It is another comp cost to doing that, but helps if you really just have a trend in the window that creates a wider spread in the data.

lukeloken commented 8 years ago

I've been using a handful of window sizes and thought smaller would be better. This allows quick peaks to remain unflagged. Otherwise the tops of the mountains get cut off because the median remains on the valley floor.

I think that detrending could be useful. The first step in the tsoutliers package is building an ARIMA model then using the residuals as an indicator of outliers.

library(tsoutliers)
fit <- arima(data[,2], order = c(1, 1, 0))
resid <- residuals(fit)
pars <- coefs2poly(fit)
outliers <- locate.outliers(resid, pars,types = c('AO'),cval=25)

plot(data[,1],data[,2],cex=0.5,pch=19,col='red', ylab=names(data)[2], xlab="")
points(data[outliers$ind,1],data[outliers$ind,2],cex=0.5,pch=19,col='black')

I haven't used these methods before, but am interested to dig into them if they seem like a good idea.

USGS-R / sensorQC

Clean Data Export #36