Building a npm package containing the classifier

MatthiasBaur commented 8 years ago

Hello Team! In the meeting today we decided to move on to building the npm package. Here is a outline for what has to be done:

[x] Choose the classifier
[x] Choose subset of attributes
[x] Choose size of dataset to use (also: timeframe)
[x] Choose classifier to use
[x] Build npm package
[ ] Clean up Code
[x] Remove attributes we don't need
[x] Provide API for pokemon prediction
[x] Include classifier as java import
[x] Include functionality to predict for given lat/lon
[x] Include API requests for getting the amount of data we need (see bulletin above)

Some additional information: The classifier bulletins are almost done, the final test is running over night. We did the final reduction today and now the weather feature is excluded alongside a bunch of other stuff. I will reference the final features here shortly. This means that the whole npm stuff can and should be kicked off. @Aurel-Roci and @bensLine the design will fall to you as you have the most experience. Create the issues that arrive so we can tackle them. We discussed a whole lot during the meeting, so I probably forgot some points. Please ask here so the answers are public. I will reference two other issues here shortly for the output of the predictions and the classifier results.

Happy coding :smile: Matthias

MatthiasBaur commented 8 years ago

Attribute Selection: #62 Prediction Format: #61

sacdallago commented 8 years ago

Likeey likey this issue

bensLine commented 8 years ago

Hey guys, thanks for the big update @MatthiasBaur ! I was on holidays over the long weekend, so here are few questions now :)

Choose the classifier

all those points will be decided in #62, correct?

Include classifier as java import

so we should use weka in JS? here is a npm package for that https://www.npmjs.com/package/node-weka. is this what you meant?

Then there are several points about the API which are discussed in #44 and #61

Include functionality to predict for given lat/lon Provide API for pokemon prediction Include API requests for getting the amount of data we need

What about the training of the classifier? do we use one model or do we retrain a model every hour or ... did you decide already anything about that? @goldbergtatyana @MatthiasBaur

Other than that, I'll start cleaning up the repo and so on when the attributes are selected in #62 and add the weka classifier and an internal interface for @Aurel-Roci "to predict for given lat/lon". And Aurel provides this data through the API as discussed in #44. Please correct me if I'm wrong :)

MatthiasBaur commented 8 years ago

Morning,

Choose the classifier

Yes.

Include classifier as java import

Check these: https://weka.wikispaces.com/Programmatic+Use https://weka.wikispaces.com/Use+WEKA+in+your+Java+code

Include functionality to predict for given lat/lon Provide API for pokemon prediction Include API requests for getting the amount of data we need

We train the classifier at intervals. Which interval will probably depend on the machine the package is running on.

TatyanaGoldberg commented 8 years ago

We dont need to train a classifer per se. All we will do is passing a training file and a test file to the weka package (a java call) and this we will do every time a user opens an app or gets to the edge of the 9*9 gitter.

MatthiasBaur commented 8 years ago

Training file == .arff file? But that means we have ~0.1 seconds computation every time a call is made.

TatyanaGoldberg commented 8 years ago

yes training file and test file are two arff files. 0.1 seconds is not much, i.e. acceptable, no?

MatthiasBaur commented 8 years ago

I don't think that will work. Or to put it another way: It would be saver to train the network at regular intervals. That way we won't have any problems when there are peaks. And we won't lose performance on the classifier.

bensLine commented 8 years ago

I agree, but let's see how it works with the node package. Anyway, how should the training set look like. Do we consider the last 10k entries from the API? Because the feature appearedDayOfWeek might not be of much help if we have more than 10k sightings within a week.

sacdallago commented 8 years ago

Keep us posted on this. We ( @TatyanaGoldberg @gyachdav and I) were also discussing this yesterday quite a lot and this might be trickier than initially thought. I would suggest you put most of your focus on this now

MatthiasBaur commented 8 years ago

I just ran some tests using a subset of the data gathered during one week. Before we had a smaller timeframe. As a result, the co-occurrence dropped and the classifier didn't perform as well. So we should probably take data from the last hour and pick 10k evenly distributed over this hour. This also means appearedDayOfWeek is out. I also ran tests on this and the performance was comparable (<0.05%).

bensLine commented 8 years ago

@MatthiasBaur I added the info from your post in #62 about the classifier and attributes to the wiki. Let's keep the wiki up-to-date so that we have one place for the current configuration. I removed appearedDayOfWeek from the list, but please have a look if it is still correct and edit it if necessary :)

marwage commented 8 years ago

I created an Gmail-Account and npmJS-Account that we can publish our package. Also I figured out how it works. It's pretty easy.

bensLine commented 8 years ago

@TatyanaGoldberg i created a test arff file for a query (lat, lon, time) and realized that we cannot use the co occurrence. When i enrich the data with our features i need the pokemon Id to calculate the co occurrence and obviously we don't have it, as we want to predict it 🙈 What should we do here? @MatthiasBaur any idea for the co occurrence?

Besides that, for a ~2km area with 81 grids most of the features have been the same. I think we need a bigger scale. Otherwise we end up with pretty similar predictions for the grid

goldbergtatyana commented 8 years ago

@TatyanaGoldberg i created a test arff file for a query (lat, lon, time) and realized that we cannot use the co occurrence.

Of course we can! You need to check whether the few pokemons that contribute to the prediction (ID 16, ID 19, and some others) were sighted within the last 24 hours and within a distance of 100m from the current location. This is how @MatthiasBaur collected co-occurences. Matthias, please correct the numbers if they are wrong.

Besides that, for a ~2km area with 81 grids most of the features have been the same.

Yes, most, but not all. So, you are saying that we are then predicting the same pokemon at all 81 points? Is this the case? Is there a difference in the prediction score for them? I discussed with @MatthiasBaur about setting a threshold belowso that below a certain prediction score a prediction becomes "no pokemon predicted".

bensLine commented 8 years ago

@goldbergtatyana ok thanks, that's good! I was already afraid 🙈 In this case we are good :)

ok, we'll come up with some thresholds

sacdallago commented 8 years ago

Don't mean to be pushy but... 🕐

goldbergtatyana commented 8 years ago

The plan was to release the app this week. Guys, can you describe what's the current status of this sub project now, i.e. what are you working now on, what are your todos and the time estimate when they are done?

bensLine commented 8 years ago

@goldbergtatyana Basically it's done. The script takes a query as input (lat, lon, time), creates 9x9 grids and then an arff file with an entry for every cell. This arff file is then used by weka via a command line call and returns the predictions. Those are parsed and will be retuned as json array. There are a few small task to complete, but this will be done till tomorrow morning. However, the server where the script will run needs Java in order to use the weka jar. @sacdallago is Java already available?

sacdallago commented 8 years ago

The server has not been set up yet, as we have to do everything with extreme caution in order to get finances right with TUM. Should be bought today.

If you need java, the only thing that needs to be done is to change the Dockerfile in @PokemonGoers/pokedata to apt-get the latest java and then test out if the docker container works :) no need to directly install java on the machine. Coordinate with them, please

bensLine commented 8 years ago

@sacdallago predictions can be generated with the predict(lat, lon, ts) method in #68. However, it turned out that there is a problem with retraining the classifier. The idea was to retrain it every 15min as a background task. I couldn't manage to do that with javascript. I tried to use setIntervall but the retainFucntion blocked other tasks until it was done - and it takes a while to retrain the classifier. And with web-workers I could not use the functions to retrain the classifier as require and importScripts lead to the state that the worker did do nothing. I'm a Newby in JS and thought it would be straightforward to do this as a background task. Most likely I did something wrong but I spend already a lot of time on it and don't know how to fix it. How should we handle this?

Besides, currently we use only 1000 entries for the dataset, as we do not know how the 10k set should look like, see https://github.com/PokemonGoers/PredictPokemon-2/issues/61#issuecomment-253574736

sacdallago commented 8 years ago

So.. You are not doing it with JS, what are you doing it with then? :)

bensLine commented 8 years ago

? I'm using JS. weka is used via command line though as it's a jar. How would you run a background task asynchronously?

sacdallago commented 8 years ago

My above comment was relative to:

I couldn't manage to do that with javascript.

The retraining is also done via weka, right? Sorry. I'm not getting your problem, your description is not helping me right now 🙈

You build the arff file via JS to retrain
You retrain by spawning a child_process? You retrain, how? In which language? What is that takes up time?

bensLine commented 8 years ago

I see :P

You build the arff file via JS to retrain

yep

You retrain by spawning a child_process?

yep, java -classpath ./data/weka.jar ...

You retrain, how?

currently I don't as it did not work. I tried:

setIntervall(retrainClassifier, 15min) for this approach I used another setIntervall(testPrediction, 1s) to mock queries. Through console logs I saw that a testPrediction was executed every second but when the retrainClassifier was executed no testPrediction have been executed. only after the retrainClassifier finished the testPrediction continued. Which means queries would be blocked by the training if I get it right?
webworker I could not use the functions to create the training arff file and retrain the model with weka inside the worker. scripts to create t

In which language?

all is in JS

What is that takes up time?

creating the arff file takes a long time for 10k and the co-occurrence calculation. And training the classifier with the arff file as well.

goldbergtatyana commented 8 years ago

Okay, did I get it right, @bensLine that 'retraining' means for you collecting new data points to write a training arrf file? According to @MatthiasBaur this task takes about half a minute with JS. This is cool.

Now, please paste here the command line command you use for running predictions with weka.jar.

Again, what is the problem? :)

bensLine commented 8 years ago

Ok, I wasn't aware that half a minute is fine for creating the arff. Training will take additionally some time. But that might be ok then. However, the problem is that during this half minute (or maybe more) we can not handle requests. so every 15min the service would not respond until the classifier is retrained. I wanted to train the classifier in the background to allow requests while training but I cannot get this running. @goldbergtatyana

goldbergtatyana commented 8 years ago

Aha, we are mixing terms here!

So, @PokemonGoers/predictpokemon-2 heads up:

When we speak about training we mean finding a prediction algorithm, finding its optimal parameters and finding the best feature set. This is an iterative and time-consuming process. However, once all decisions are made we say 'we found a model', we do not train anything anymore. We move on to testing.
When we speak about testing we use the training arff file as a train set to create a separating hyperplane with the kernel we found in 1. and then use the prediction algorithm also from 1. to predict classes for new instances from the test arff file. So, for doing predictions now we run a command that must look as following: weka.jar: your_classifier -t training_arff_file -T test_arff_file here_follow_parameters_of_the_classifer. This step should max take a second or two (absolute max). Please let us know what is the prediction time for 81 test data points. Also, please let us know if there are questions.

bensLine commented 8 years ago

the training cmd is java -classpath ./data/weka.jar -Xmx1024m weka.classifiers.meta.Vote -S 1 -B "weka.classifiers.lazy.IBk -K 100 -W 0 -A \"weka.core.neighboursearch.LinearNNSearch -A \\\"weka.core.EuclideanDistance -R first-last\\\"\"" -B "weka.classifiers.bayes.BayesNet -D -Q weka.classifiers.bayes.net.search.local.K2 -- -P 1 -S BAYES -E weka.classifiers.bayes.net.estimate.SimpleEstimator -- -A 0.5" -R PROD -no-cv -v -classifications "weka.classifiers.evaluation.output.prediction.Null" -t data/trainingData.arff -d data/classifier.model

this cmd creates a classifier model which is stored so we save a few milliseconds per prediction. The prediction cmd is java -classpath ./data/weka.jar -Xmx1024m weka.classifiers.meta.Vote -classifications "weka.classifiers.evaluation.output.prediction.CSV" -l data/classifier.model -T data/testData.arff

The creation of the arff file takes on my machine, which is a bit slow, for 10k about 48 seconds, most of the time is needed for the co-occurrence calculation (42s). The creation of an arff file and the prediction for 81 test data points takes around 2.6 seconds. about 1 second of this time is needed for the co-occurrence.

goldbergtatyana commented 8 years ago

this cmd creates a classifier model which is stored so we save a few milliseconds per prediction.

Since we are talking about milliseconds either way is fine (storing the model or using the train_arff_file directly)

most of the time is needed for the co-occurrence calculation (42s).

It takes so long to calculate the co-oocurrence with four pokemons (cooc13, cooc16, cooc19, cooc129)?

bensLine commented 8 years ago

Yes, we're looking into it to optimize this

sacdallago commented 8 years ago

@bensLine worst case: make the interval for training optionable, meaning that default is 15 min but if a user passes a setting which states 60min, it will retrain every 60 min.

About the sync/async problem: I don't think the problem is bound to JS, it's bound to java, if I get it right.. So there are only ugly solutions to solve this, like storing incremental file names and keeping a reference to the latest and second latest (while training)

goldbergtatyana commented 8 years ago

@bensLine what is the threshold you came up with for new predictions and how did you select it?

bensLine commented 8 years ago

There is actually no threshold :p I used a timer which triggers every 15min.

goldbergtatyana commented 8 years ago

Can you please elaborate @bensLine ?

bensLine commented 8 years ago

So when the package is used the first time the data of the last 24h from the API is downloaded, 10k are randomly selected and a training set is created. After that a 15 min timer is started. When the timer triggers the training set is recreated as described above. Predictions are made with the current training set. Does this answer your question? Not really sure what you mean with the threshold. :p

goldbergtatyana commented 8 years ago

It is great to hear again how the training is done +1

My question was a different one. Our classifier is configured such that it always returns a prediction for any location point it takes as input. This of course cannot be true. Therefore, we said we define a threshold below which we consider a prediction not to be valid (i.e. no prediction is made). The predictions above a threshold we display on the map.

A threshold is decided by looking at sensitivity/specificity (see pic [1]) -> while one goes up, another one goes down. Ideally it is good to have them both in balance.

[1] https://rostlab.org/~goldberg/threshold.png

goldbergtatyana commented 8 years ago

A threshold is the prediction probability returned by Weka.

PokemonGoers / PredictPokemon-2

Building a npm package containing the classifier #60