Closed jreadey closed 8 years ago
Recommended first step: describe what your filter would do. What type of inputs would it need? How are results listed?
SciKit learn may be useful: http://scikit-learn.org/stable/modules/outlier_detection.html.
When @gheber was working on Spark with NCEP, I wanted to detect [El Nino](http://www.elnino.noaa.gov|El Nino) event. I'd like to implement filter that can detect unusual rise in sea surface temperature to confirm NOAA's finding.
We can compare result with NOAA's findings.
My hours estimate for this task is : 65.2.
I was thinking more about finding cases where the data is clearly incorrect rather than natural fluctuations of environmental data. Though I guess this can at some point overlap the naturally occurring but unusual cases such as El Nino events given a tight-enough conference band.
We have an example of bad data in the NCEP dataset, why not start with that?
NCEP's bad data are already corrected. Did you download the same set that Gerd had?
I still have the uncorrected data on disk.
Alternatively we could copy the current dataset and inject some synthetic outliers into it.
Anything synthetic is not interesting. That's for software testing.
I think we should spend time on something that Earth scientists would like to achieve with earth data.
This is the topic for the Data Model Phase II SOW:
What would be the requirements to preserve the scientific credibility of the data and model output if a scientist where to use this site instead of the stewardship site for analysis?
Detecting El Nino events wouldn't seem to address this. How can we validate that the data we have transformed (through different chunk layouts, compression, aggregation, etc.) is correct?
OK. We've already validated through min/max/stddev computation. Why do you need anomaly detection filter?
Is that sufficient?
Say I start with [1, 2, 3, 4, 5] and my transform results in: [5,1,4,3,2]. Same min/max/stdev but not equivalent.
Also, how can we assert that the source data is correct? (e.g. the NCEP case) Or that it was downloaded correctly?
@gheber - do you have any thoughts on this?
@jready Can you explain what data and model output in Phase II SOW? I think you don't seem to understand.
Comparing [1, 2, 3, 4, 5] and [5, 1, 4, 3, 2] doesn't require the noble concept of "anomaly detection." It belongs to simple h5diff.
I'm thinking of a transform like:
Where the intent of the transform is to improve accessibility of the data (via better performance, reduced storage, better programming model). Given this, how do we ensure the data is correct?
Tell me what your thoughts are on the meaning of the SOW II...
h5diff we not work for cases where we've munged multiple files to one (aggregation). Or lossely compression filters (like MAFSIC).
diff can do if you dump as ascii.
My interpretation of SOW II:
Model: https://en.wikipedia.org/wiki/Climate_model Data: Actual satellite observations that can tell how well climate model predicted.
That seems more like climate research than a data model study.
Of course, earthdata is about climate research and NASA scientists cares about it most.
@jreadey I installed scikit-learn and learned how to use it. However, machine learning is generally for creating a model with training data and testing the learned model against actual data. @gheber's plateau detection (no change in temperature along time dimension) doesn't require such scikit-learn. My questions are:
1) Do you want me to work on filter for plateau detection? 2) If so, do you want me to use scikit-learn outlier detector or not? I checked the Gerd's graph on his blog and the values are within the normal range so they are not really considered as outliers. If we transform data by computing deltas, we may classify and detect 0 delta points as outliers but I don't see any benefit of machine learning.
@hyoklee - For plateau detection wouldn't it be sufficient to look at the std dev of different sub-regions of the dataset? If the sub-region falls on a plateau, the std dev will be 0 which will be an outlier to std dev found in other regions of the dataset.
For outlier detection, is scikit-learn the best approach or are there simpler methods for sufficient for our purposes?
@jreadey Yes, std dev calculation will do the work, too. Since we have std dev filter already, I think I don't have to create another filter.
For the second question, I don't know if scikit-learn is the best. Do you want me to investigate other ML softwares? Outlier detection is a very broad term so I hope you can narrow down "our purposes" if you are seeking a simpler method.
@hyoklee - The existing filter wouldn't work that create because it's computing the stdev of the entire earth grid. If a small region has repeated zero's it would trigger anything abnormal in the std dev.
I'd think the filter would need to do overlapping sub-samples of the grid and then signal any outliers in this set.
For outlier detection, so if you can find out what research has been done in this area (I'm sure this is a well investigated topic). Don't forget to think about detecting outliers in the time direction as well as geographic variances.
@jreadey I think existing filter will report stdev = 0 for @gheber collection when it's run on non-aggregated 7850 HDF5 files.
I agree that the existing code won't work for small regions but it's matter of creating a small window and run it against subset. I can write a code with window size 2 and scan it along the entire data in time direction. Should I proceed with this first?
Thinking about the geographic direction, Window size 2 may report too many stddev=0 cases along geolocation direction especially along longitude. How big do you want me to start with?
Outlier in learned model A is normal in learned model B. It all depends on how you train and test with data. Each learning algorithm has many parameters to control, too. I'll research what climate modelers use to detect anomalies. I found https://pypi.python.org/pypi/scikits.eartho but there's nothing useful.
I found an interesting paper that uses one-class svm (author includes Kamalika Das):
There was a case that std dev is same for two consecutive date in Tair_2m.
No change in std dev at time index = 6480
However, min/max values are different so I think it's a false alarm. I also examined the file contents with HDFView. The script can be improved to consider min/max value differences as well.
Also, the script can utilize output from summary.py.
I'll follow the same strategy described in the paper for outlier detection via sckit-learn.
4.2. Detailed description. The overall distributed anomaly detection algorithm consists of two stages. >The pseudo code for the first step is shown in Alg. 1. In this step, each node computes the local outliers >independently. The input to this local step are the dataset at each node Di, the size of training set Ts, a >seed s of the random number generator, and the parameter ν. The algorithm first sets the seed of the >random number generator to s. Then it selects a sample of size Ts from Di and uses it as the training >set (Ti). The rest is used for the testing phase Hi. It then builds an SVM model Mi using Ti and ν. Once >the model has been built, all points in Hi are tested using the set of support vectors defined by Mi. All >those elements in Hi whose test score is negative is returned as the set of outliers Oi.
I could detect some outliers (in red dot) from SST but I don't understand why there are red dots in the middle that are classified as outliers by One SVM.
What was the filter that produced the graph?
Here's the final result that detected historic 1998 El Nino.
Create a filter that would list any anomalous data.