HDFGroup / datacontainer

Data Container Study
Other
8 stars 1 forks source link

Create anomaly detection filter #31

Closed jreadey closed 8 years ago

jreadey commented 8 years ago

Create a filter that would list any anomalous data.

jreadey commented 8 years ago

Recommended first step: describe what your filter would do. What type of inputs would it need? How are results listed?

SciKit learn may be useful: http://scikit-learn.org/stable/modules/outlier_detection.html.

hyoklee commented 8 years ago

When @gheber was working on Spark with NCEP, I wanted to detect [El Nino](http://www.elnino.noaa.gov|El Nino) event. I'd like to implement filter that can detect unusual rise in sea surface temperature to confirm NOAA's finding.

  1. Input: SST dataset near equilateral pacific.
  2. Output: Year that has the strongest El Nino occurred. Sort the results by year.
  3. Filter: Subset data on Pacific region only. Find outliers of SST using scikit-learn.

We can compare result with NOAA's findings.

My hours estimate for this task is : 65.2.

jreadey commented 8 years ago

I was thinking more about finding cases where the data is clearly incorrect rather than natural fluctuations of environmental data. Though I guess this can at some point overlap the naturally occurring but unusual cases such as El Nino events given a tight-enough conference band.

We have an example of bad data in the NCEP dataset, why not start with that?

hyoklee commented 8 years ago

NCEP's bad data are already corrected. Did you download the same set that Gerd had?

gheber commented 8 years ago

I still have the uncorrected data on disk.

jreadey commented 8 years ago

Alternatively we could copy the current dataset and inject some synthetic outliers into it.

hyoklee commented 8 years ago

Anything synthetic is not interesting. That's for software testing.

I think we should spend time on something that Earth scientists would like to achieve with earth data.

jreadey commented 8 years ago

This is the topic for the Data Model Phase II SOW:

What would be the requirements to preserve the scientific credibility of the data and model output if a scientist where to use this site instead of the stewardship site for analysis?

Detecting El Nino events wouldn't seem to address this. How can we validate that the data we have transformed (through different chunk layouts, compression, aggregation, etc.) is correct?

hyoklee commented 8 years ago

OK. We've already validated through min/max/stddev computation. Why do you need anomaly detection filter?

jreadey commented 8 years ago

Is that sufficient?

Say I start with [1, 2, 3, 4, 5] and my transform results in: [5,1,4,3,2]. Same min/max/stdev but not equivalent.

Also, how can we assert that the source data is correct? (e.g. the NCEP case) Or that it was downloaded correctly?

@gheber - do you have any thoughts on this?

hyoklee commented 8 years ago

@jready Can you explain what data and model output in Phase II SOW? I think you don't seem to understand.

hyoklee commented 8 years ago

Comparing [1, 2, 3, 4, 5] and [5, 1, 4, 3, 2] doesn't require the noble concept of "anomaly detection." It belongs to simple h5diff.

jreadey commented 8 years ago

I'm thinking of a transform like:

-->

Where the intent of the transform is to improve accessibility of the data (via better performance, reduced storage, better programming model). Given this, how do we ensure the data is correct?

Tell me what your thoughts are on the meaning of the SOW II...

jreadey commented 8 years ago

h5diff we not work for cases where we've munged multiple files to one (aggregation). Or lossely compression filters (like MAFSIC).

hyoklee commented 8 years ago

diff can do if you dump as ascii.

hyoklee commented 8 years ago

My interpretation of SOW II:

Model: https://en.wikipedia.org/wiki/Climate_model Data: Actual satellite observations that can tell how well climate model predicted.

jreadey commented 8 years ago

That seems more like climate research than a data model study.

hyoklee commented 8 years ago

Of course, earthdata is about climate research and NASA scientists cares about it most.

hyoklee commented 8 years ago

@jreadey I installed scikit-learn and learned how to use it. However, machine learning is generally for creating a model with training data and testing the learned model against actual data. @gheber's plateau detection (no change in temperature along time dimension) doesn't require such scikit-learn. My questions are:

1) Do you want me to work on filter for plateau detection? 2) If so, do you want me to use scikit-learn outlier detector or not? I checked the Gerd's graph on his blog and the values are within the normal range so they are not really considered as outliers. If we transform data by computing deltas, we may classify and detect 0 delta points as outliers but I don't see any benefit of machine learning.

jreadey commented 8 years ago

@hyoklee - For plateau detection wouldn't it be sufficient to look at the std dev of different sub-regions of the dataset? If the sub-region falls on a plateau, the std dev will be 0 which will be an outlier to std dev found in other regions of the dataset.

For outlier detection, is scikit-learn the best approach or are there simpler methods for sufficient for our purposes?

hyoklee commented 8 years ago

@jreadey Yes, std dev calculation will do the work, too. Since we have std dev filter already, I think I don't have to create another filter.

For the second question, I don't know if scikit-learn is the best. Do you want me to investigate other ML softwares? Outlier detection is a very broad term so I hope you can narrow down "our purposes" if you are seeking a simpler method.

jreadey commented 8 years ago

@hyoklee - The existing filter wouldn't work that create because it's computing the stdev of the entire earth grid. If a small region has repeated zero's it would trigger anything abnormal in the std dev.

I'd think the filter would need to do overlapping sub-samples of the grid and then signal any outliers in this set.

For outlier detection, so if you can find out what research has been done in this area (I'm sure this is a well investigated topic). Don't forget to think about detecting outliers in the time direction as well as geographic variances.

hyoklee commented 8 years ago

@jreadey I think existing filter will report stdev = 0 for @gheber collection when it's run on non-aggregated 7850 HDF5 files.

I agree that the existing code won't work for small regions but it's matter of creating a small window and run it against subset. I can write a code with window size 2 and scan it along the entire data in time direction. Should I proceed with this first?

Thinking about the geographic direction, Window size 2 may report too many stddev=0 cases along geolocation direction especially along longitude. How big do you want me to start with?

Outlier in learned model A is normal in learned model B. It all depends on how you train and test with data. Each learning algorithm has many parameters to control, too. I'll research what climate modelers use to detect anomalies. I found https://pypi.python.org/pypi/scikits.eartho but there's nothing useful.

hyoklee commented 8 years ago

I found an interesting paper that uses one-class svm (author includes Kamalika Das):

https://nex.nasa.gov/nex/static/media/publication/DAnom.pdf

hyoklee commented 8 years ago

There was a case that std dev is same for two consecutive date in Tair_2m.

No change in std dev at time index = 6480

However, min/max values are different so I think it's a false alarm. I also examined the file contents with HDFView. The script can be improved to consider min/max value differences as well.

Also, the script can utilize output from summary.py.

hyoklee commented 8 years ago

I'll follow the same strategy described in the paper for outlier detection via sckit-learn.

4.2. Detailed description. The overall distributed anomaly detection algorithm consists of two stages. >The pseudo code for the first step is shown in Alg. 1. In this step, each node computes the local outliers >independently. The input to this local step are the dataset at each node Di, the size of training set Ts, a >seed s of the random number generator, and the parameter ν. The algorithm first sets the seed of the >random number generator to s. Then it selects a sample of size Ts from Di and uses it as the training >set (Ti). The rest is used for the testing phase Hi. It then builds an SVM model Mi using Ti and ν. Once >the model has been built, all points in Hi are tested using the set of support vectors defined by Mi. All >those elements in Hi whose test score is negative is returned as the set of outliers Oi.

hyoklee commented 8 years ago

I could detect some outliers (in red dot) from SST but I don't understand why there are red dots in the middle that are classified as outliers by One SVM. figure_1

jreadey commented 8 years ago

What was the filter that produced the graph?

hyoklee commented 8 years ago

Here's the final result that detected historic 1998 El Nino.

https://en.wikipedia.org/wiki/1997–98_El_Niño_Event one_svm