Discard GPS outliers - Githubissues

shankari commented 9 years ago

As we saw, the raw data from the phones is very noisy. We can smooth the data using the path filter, but we can also just perform a first pass using the RANSAC algorithm. There is an existing implementation of RANSAC in scikit-learn. http://scikit-learn.org/stable/modules/linear_model.html#robustness-to-outliers-ransac

Test out RANSAC on our existing data and see how well it works. If it works well, integrate it as a cleaning step when we receive the data (CFC_WebApp/main/tripManager.py).

Eventually, we will want to move this to the phones as well, so that the server gets only clean data.

I will share a bunch of my own raw GPS data with the team that picks this issue.

sdsingh commented 9 years ago

My team, with Gautham and Jeffrey would like to claim this issue.

shankari commented 9 years ago

Great! I'll make my raw GPS data, with accuracy, available on bcourses tonight.

Also, in that case, I'll want to see a phone project of some sort from Shaun and Jeffrey - either code you have written before, or the sample project for the platform of your choice. I've already seen several samples of Gautham's phone code... :) https://bcourses.berkeley.edu/courses/1306800/assignments/6059151

You might also want to start thinking right now about how to figure out "how well it works" since the data is not labelled :)

jeffdh5 commented 9 years ago

The goal of discarding outliers is to improve classification accuracy, right? If so then we could build a model with outliers discarded, and with outliers included, and compare the classification results.

And I will get on sample phone project very soon!

shankari commented 9 years ago

The goal of discarding the outliers is two fold:

definitely improving the classification accuracy
- for example, it turns out that almost all of my train trips with the new data collection were being classified as "air". My speculation is that this is because the existing decision tree that we had weighted some kind of speed attribute (max speed or acceleration) very heavily for the "air" label. And because the outliers were so far away, it looked like the speed for that section was really high (from Berkeley to El Cerrito in 30 secs -> must be a plane). I have noticed that the tree is getting retrained and it doesn't happen as much anymore :)
improving the way that the trips look to improve confidence
- having a bunch of outliers instead of a smooth curve makes the trips "look wrong" when users are confirming trips. This makes it harder to confirm the trips, and also leads to users trusting our code less.

Your evaluation technique (clever thought, btw) will definitely test the first option. Can you think of a way to test the second? (It is OK if you can't).

shankari commented 9 years ago

In case you want to visualize how the raw data looks as trips, there is a script CFC_WebApp/main/gmap_display that uses pygmaps to plot trips.

@Mogeng has some scripts that update those to show the points, shaded by their accuracies. @Mogeng, can you check in that code so that this team can get a head start on their trips...

Mogeng commented 9 years ago

Just pushed the script and 3 visualization html file in deployment_data/raw_save_our_data~

Mogeng On Feb 6, 2015, at 10:15 PM, shankari notifications@github.com wrote:

In case you want to visualize how the raw data looks as trips, there are some scripts in CFC_WebApp/main/gmap_display that use pygmaps to plot trips.

@Mogeng has some scripts that update those to show the points, shaded by their accuracies. @Mogeng, can you check in that code so that this team can get a head start on their trips...

— Reply to this email directly or view it on GitHub.

shankari commented 9 years ago

These have now been uploaded to bcourses as well and a link sent to the team members.

jeffdh5 commented 9 years ago

Is there some sample code we can use to set up our own mode classifier? How do we extract the confidence for a new classification?

shankari commented 9 years ago

The existing mode classifier (based on a random forest), is available at CFC_DataCollector/modeinfer. If you have successfully imported all the trips into the database using moves.collect, then you can just run the existing pipeline using something like:

 cd ..../CFC_DataCollector && .../anaconda/bin/python modeinfer/pipeline.py >> /tmp/pipeline.stdinoutlog 2>&1

You are missing two crucial pieces of data to make this work. Can you identify what those are?

Once you have successfully run the classifier, each trip in the database will have a predictedModes array that gives you the confidences for each mode. The value displayed on the client is the mode with the highest confidence.

An explanation of the various modes might be instructive. https://github.com/e-mission/e-mission-server/blob/master/CFC_WebApp/README.modes

Your prior experience with pymongo is likely to be useful while poking around the data at this time :)

shankari commented 9 years ago

You are missing two crucial pieces of data to make this work. Can you identify what those are?

By "this", I mean your evaluation technique :)

jeffdh5 commented 9 years ago

Hm my intuition says I need a training set and correct labels for that set.

On Sun, Feb 8, 2015 at 7:13 PM, shankari notifications@github.com wrote:

You are missing two crucial pieces of data to make this work. Can you identify what those are? By "this", I mean your evaluation technique :)

— Reply to this email directly or view it on GitHub https://github.com/e-mission/e-mission-server/issues/82#issuecomment-73451862 .

shankari commented 9 years ago

Correct. So next question is: how do you get a training set and correct labels for the set?

I actually do have an answer to this, but I want your team to try to figure it out because:

you might come up with a better answer than me, and I don't want to bias you
it is valuable experience for the real world, where you might not have a clearly defined problem

jeffdh5 commented 9 years ago

We could process all the past trips data and save only the ones that have been classified by users; we can't use unclassified data even if we have high confidence scores because we want to maintain that our training data is grounded in truth.

On Mon, Feb 9, 2015 at 8:13 PM, shankari notifications@github.com wrote:

Correct. So next question is: how do you get a training set and correct labels for the set?

I actually do have an answer to this, but I want your team to try to figure it out because:

you might come up with a better answer than me, and I don't want to bias you

it is valuable experience for the real world, where you might not have a clearly defined problem

— Reply to this email directly or view it on GitHub https://github.com/e-mission/e-mission-server/issues/82#issuecomment-73641593 .

shankari commented 9 years ago

Agreed that we can't use unclassified data.

Can you expand a bit more on the classified data? Do you want to use data collected from moves? From our own data collection? Which is better?

Also, note that training data is not enough...

shankari commented 9 years ago

Fixed via 30c20197e85132f896e500a699e5f871f196b4fe

shankari commented 9 years ago

It is unclear that the code to run this in collect.py was ever tested. It broke when I deployed it, and I was so excited about the feature that I actually tried to patch it multiple times, but gave up when I got to 5 changes and it was still broken.

I have now disabled this feature on the server. Please submit a new pull request that includes a unit test (in TestMovesCollect).

shankari commented 9 years ago

@sdsingh, @jeffdh5, @gaukes, just sent you a patch of all the changes that I had to do so far.

e-mission / e-mission-docs

Discard GPS outliers #117