Closed MatthiasBaur closed 8 years ago
Attribute Selection: #62 Prediction Format: #61
Likeey likey this issue
Hey guys, thanks for the big update @MatthiasBaur ! I was on holidays over the long weekend, so here are few questions now :)
Choose the classifier
all those points will be decided in #62, correct?
Include classifier as java import
so we should use weka in JS? here is a npm package for that https://www.npmjs.com/package/node-weka. is this what you meant?
Then there are several points about the API which are discussed in #44 and #61
Include functionality to predict for given lat/lon Provide API for pokemon prediction Include API requests for getting the amount of data we need
What about the training of the classifier? do we use one model or do we retrain a model every hour or ... did you decide already anything about that? @goldbergtatyana @MatthiasBaur
Other than that, I'll start cleaning up the repo and so on when the attributes are selected in #62 and add the weka classifier and an internal interface for @Aurel-Roci "to predict for given lat/lon". And Aurel provides this data through the API as discussed in #44. Please correct me if I'm wrong :)
Morning,
Choose the classifier
Yes.
Include classifier as java import
Check these: https://weka.wikispaces.com/Programmatic+Use https://weka.wikispaces.com/Use+WEKA+in+your+Java+code
Include functionality to predict for given lat/lon Provide API for pokemon prediction Include API requests for getting the amount of data we need
We train the classifier at intervals. Which interval will probably depend on the machine the package is running on.
We dont need to train a classifer per se. All we will do is passing a training file and a test file to the weka package (a java call) and this we will do every time a user opens an app or gets to the edge of the 9*9 gitter.
Training file == .arff file? But that means we have ~0.1 seconds computation every time a call is made.
yes training file and test file are two arff files. 0.1 seconds is not much, i.e. acceptable, no?
I don't think that will work. Or to put it another way: It would be saver to train the network at regular intervals. That way we won't have any problems when there are peaks. And we won't lose performance on the classifier.
I agree, but let's see how it works with the node package.
Anyway, how should the training set look like. Do we consider the last 10k entries from the API? Because the feature appearedDayOfWeek
might not be of much help if we have more than 10k sightings within a week.
Keep us posted on this. We ( @TatyanaGoldberg @gyachdav and I) were also discussing this yesterday quite a lot and this might be trickier than initially thought. I would suggest you put most of your focus on this now
I just ran some tests using a subset of the data gathered during one week. Before we had a smaller timeframe. As a result, the co-occurrence dropped and the classifier didn't perform as well. So we should probably take data from the last hour and pick 10k evenly distributed over this hour. This also means appearedDayOfWeek is out. I also ran tests on this and the performance was comparable (<0.05%).
@MatthiasBaur I added the info from your post in #62 about the classifier and attributes to the wiki. Let's keep the wiki up-to-date so that we have one place for the current configuration. I removed appearedDayOfWeek
from the list, but please have a look if it is still correct and edit it if necessary :)
I created an Gmail-Account and npmJS-Account that we can publish our package. Also I figured out how it works. It's pretty easy.
@TatyanaGoldberg i created a test arff file for a query (lat, lon, time) and realized that we cannot use the co occurrence. When i enrich the data with our features i need the pokemon Id to calculate the co occurrence and obviously we don't have it, as we want to predict it π What should we do here? @MatthiasBaur any idea for the co occurrence?
Besides that, for a ~2km area with 81 grids most of the features have been the same. I think we need a bigger scale. Otherwise we end up with pretty similar predictions for the grid
@TatyanaGoldberg i created a test arff file for a query (lat, lon, time) and realized that we cannot use the co occurrence.
Of course we can! You need to check whether the few pokemons that contribute to the prediction (ID 16, ID 19, and some others) were sighted within the last 24 hours and within a distance of 100m from the current location. This is how @MatthiasBaur collected co-occurences. Matthias, please correct the numbers if they are wrong.
Besides that, for a ~2km area with 81 grids most of the features have been the same.
Yes, most, but not all. So, you are saying that we are then predicting the same pokemon at all 81 points? Is this the case? Is there a difference in the prediction score for them? I discussed with @MatthiasBaur about setting a threshold belowso that below a certain prediction score a prediction becomes "no pokemon predicted".
@goldbergtatyana ok thanks, that's good! I was already afraid π In this case we are good :)
ok, we'll come up with some thresholds
Don't mean to be pushy but... π
The plan was to release the app this week. Guys, can you describe what's the current status of this sub project now, i.e. what are you working now on, what are your todos and the time estimate when they are done?
@goldbergtatyana Basically it's done. The script takes a query as input (lat, lon, time), creates 9x9 grids and then an arff file with an entry for every cell. This arff file is then used by weka via a command line call and returns the predictions. Those are parsed and will be retuned as json array. There are a few small task to complete, but this will be done till tomorrow morning. However, the server where the script will run needs Java in order to use the weka jar. @sacdallago is Java already available?
The server has not been set up yet, as we have to do everything with extreme caution in order to get finances right with TUM. Should be bought today.
If you need java, the only thing that needs to be done is to change the Dockerfile in @PokemonGoers/pokedata to apt-get
the latest java
and then test out if the docker container works :) no need to directly install java on the machine. Coordinate with them, please
@sacdallago predictions can be generated with the predict(lat, lon, ts)
method in #68. However, it turned out that there is a problem with retraining the classifier. The idea was to retrain it every 15min as a background task. I couldn't manage to do that with javascript. I tried to use setIntervall but the retainFucntion blocked other tasks until it was done - and it takes a while to retrain the classifier. And with web-workers I could not use the functions to retrain the classifier as require
and importScripts
lead to the state that the worker did do nothing. I'm a Newby in JS and thought it would be straightforward to do this as a background task. Most likely I did something wrong but I spend already a lot of time on it and don't know how to fix it.
How should we handle this?
Besides, currently we use only 1000 entries for the dataset, as we do not know how the 10k set should look like, see https://github.com/PokemonGoers/PredictPokemon-2/issues/61#issuecomment-253574736
So.. You are not doing it with JS, what are you doing it with then? :)
? I'm using JS. weka is used via command line though as it's a jar. How would you run a background task asynchronously?
My above comment was relative to:
I couldn't manage to do that with javascript.
The retraining is also done via weka, right? Sorry. I'm not getting your problem, your description is not helping me right now π
I see :P
You build the arff file via JS to retrain
yep
You retrain by spawning a child_process?
yep, java -classpath ./data/weka.jar ...
You retrain, how?
currently I don't as it did not work. I tried:
setIntervall(retrainClassifier, 15min)
for this approach I used another setIntervall(testPrediction, 1s)
to mock queries. Through console logs I saw that a testPrediction
was executed every second but when the retrainClassifier
was executed no testPrediction
have been executed. only after the retrainClassifier
finished the testPrediction
continued. Which means queries would be blocked by the training if I get it right?In which language?
all is in JS
What is that takes up time?
creating the arff file takes a long time for 10k and the co-occurrence calculation. And training the classifier with the arff file as well.
Okay, did I get it right, @bensLine that 'retraining' means for you collecting new data points to write a training arrf file? According to @MatthiasBaur this task takes about half a minute with JS. This is cool.
Now, please paste here the command line command you use for running predictions with weka.jar.
Again, what is the problem? :)
Ok, I wasn't aware that half a minute is fine for creating the arff. Training will take additionally some time. But that might be ok then. However, the problem is that during this half minute (or maybe more) we can not handle requests. so every 15min the service would not respond until the classifier is retrained. I wanted to train the classifier in the background to allow requests while training but I cannot get this running. @goldbergtatyana
Aha, we are mixing terms here!
So, @PokemonGoers/predictpokemon-2 heads up:
the training cmd is
java -classpath ./data/weka.jar -Xmx1024m weka.classifiers.meta.Vote -S 1 -B "weka.classifiers.lazy.IBk -K 100 -W 0 -A \"weka.core.neighboursearch.LinearNNSearch -A \\\"weka.core.EuclideanDistance -R first-last\\\"\"" -B "weka.classifiers.bayes.BayesNet -D -Q weka.classifiers.bayes.net.search.local.K2 -- -P 1 -S BAYES -E weka.classifiers.bayes.net.estimate.SimpleEstimator -- -A 0.5" -R PROD -no-cv -v -classifications "weka.classifiers.evaluation.output.prediction.Null" -t data/trainingData.arff -d data/classifier.model
this cmd creates a classifier model which is stored so we save a few milliseconds per prediction. The prediction cmd is
java -classpath ./data/weka.jar -Xmx1024m weka.classifiers.meta.Vote -classifications "weka.classifiers.evaluation.output.prediction.CSV" -l data/classifier.model -T data/testData.arff
The creation of the arff file takes on my machine, which is a bit slow, for 10k about 48 seconds, most of the time is needed for the co-occurrence calculation (42s). The creation of an arff file and the prediction for 81 test data points takes around 2.6 seconds. about 1 second of this time is needed for the co-occurrence.
this cmd creates a classifier model which is stored so we save a few milliseconds per prediction.
Since we are talking about milliseconds either way is fine (storing the model or using the train_arff_file directly)
most of the time is needed for the co-occurrence calculation (42s).
It takes so long to calculate the co-oocurrence with four pokemons (cooc13, cooc16, cooc19, cooc129)?
Yes, we're looking into it to optimize this
@bensLine worst case: make the interval for training optionable, meaning that default is 15 min but if a user passes a setting which states 60min, it will retrain every 60 min.
About the sync/async problem: I don't think the problem is bound to JS, it's bound to java, if I get it right.. So there are only ugly solutions to solve this, like storing incremental file names and keeping a reference to the latest and second latest (while training)
@bensLine what is the threshold you came up with for new predictions and how did you select it?
There is actually no threshold :p I used a timer which triggers every 15min.
Can you please elaborate @bensLine ?
So when the package is used the first time the data of the last 24h from the API is downloaded, 10k are randomly selected and a training set is created. After that a 15 min timer is started. When the timer triggers the training set is recreated as described above. Predictions are made with the current training set. Does this answer your question? Not really sure what you mean with the threshold. :p
It is great to hear again how the training is done +1
My question was a different one. Our classifier is configured such that it always returns a prediction for any location point it takes as input. This of course cannot be true. Therefore, we said we define a threshold below which we consider a prediction not to be valid (i.e. no prediction is made). The predictions above a threshold we display on the map.
A threshold is decided by looking at sensitivity/specificity (see pic [1]) -> while one goes up, another one goes down. Ideally it is good to have them both in balance.
A threshold is the prediction probability returned by Weka.
Hello Team! In the meeting today we decided to move on to building the npm package. Here is a outline for what has to be done:
Some additional information: The classifier bulletins are almost done, the final test is running over night. We did the final reduction today and now the weather feature is excluded alongside a bunch of other stuff. I will reference the final features here shortly. This means that the whole npm stuff can and should be kicked off. @Aurel-Roci and @bensLine the design will fall to you as you have the most experience. Create the issues that arrive so we can tackle them. We discussed a whole lot during the meeting, so I probably forgot some points. Please ask here so the answers are public. I will reference two other issues here shortly for the output of the predictions and the classifier results.
Happy coding :smile: Matthias