MobilityNet / mobilitynet.github.io

BSD 3-Clause "New" or "Revised" License
0 stars 3 forks source link

Determine how to support running analyses on the dataset #24

Open singhish opened 3 years ago

singhish commented 3 years ago

Specifically, the notebooks ending in _master are failing due to how our current analysis pipeline is set up. A decision needs to be made regarding how to handle the analysis that is currently being done by using the analysis-related keys on E-Mission.

shankari commented 3 years ago

I can think of three options for how to do this:

shankari commented 3 years ago

Playing around with NEDB: an example notebook where:

make sure to change the UUID and the timestamps to data that you have actually loaded.

shankari commented 3 years ago

to use the timeseries. you will need to add the e-mission server directory to your PYTHONPATH

shankari commented 3 years ago

Let's do the easy design first.

shankari commented 3 years ago

The goal of this project is to make it easier for others to come up with their own algorithms.

I think that the two options are:

We have time to implement one option, not two. Which should it be?

singhish commented 3 years ago

I'm leaning the second option -- having the e-mission algorithms published would make it easier for others to come up with their own algorithms based on our own provided implementation. Refactoring the e-mission codebase so that core and storage are more compartmentalized would probably also make things easier for us to work with on our end in the long run.

Am I headed in the right direction?

shankari commented 3 years ago

I'm actually leaning towards the first option. The main difference is that it is by no means clear to me that anybody wants to come up with new algorithms based on our provided implementation. I think that people want to start with the data, explore it, and try out ML libraries (keras, sklearn, etc) on it.

We have one customer request: did he ask for mongodump (which would have used the database and the existing algorithms) or files (which would be more in line with the ML library approach)?

singhish commented 3 years ago

I see, that makes sense. The customer asked for files @shankari.

shankari commented 3 years ago

@singhish ok, let's try to get a second data point with you pretending to be a customer since you are not as close to the data as I am.

Let's say you want to enter a challenge in which you need to segment a trip into multiple unimodal segments. There is an existing analysis pipeline in which that is one step, and that looks like https://github.com/e-mission/e-mission-server/blob/master/emission/pipeline/intake_stage.py This stage is at lines 129 to 135.

Would you prefer to work with notebooks that had a simpler embedded baseline, or try to work with that code to understand and improve it?

singhish commented 3 years ago

This might be a personal thing, but as a developer, probably the latter, as working with code feels a lot nicer to me as opposed to dealing with the overhead associated with running a notebook. Data scientists might prefer the former though. @shankari

shankari commented 3 years ago

@singhish @jf87 if you are back from vacation, can you make the final call since you have actually tried to work with MobilityNet before. If we don't hear back from @jf87 by Monday, we will go with option (1) by maintainer fiat.