Determine how to support running analyses on the dataset

singhish commented 3 years ago

Specifically, the notebooks ending in _master are failing due to how our current analysis pipeline is set up. A decision needs to be made regarding how to handle the analysis that is currently being done by using the analysis-related keys on E-Mission.

shankari commented 3 years ago

I can think of three options for how to do this:

first: keep the analysis in e-mission. To implement this, we would create a "development environment" that would include a docker-compose.yml file that would launch a mongodb instance, an e-mission-server instance, load the data and run the pipeline. Then we can just retrieve the analysis results from the newly created e-mission-server instance and either compare directly or save to a file or ???. if people wanted to compare two algorithms, they would need to run two instances of the development environment (maybe a port 2323 and port 4545) and then we could compare against them.
Pro: minimal dev work on our side
Con: (1) JSON file export is useless, (2) hard for users to do interactive analysis, (3) harder for users to work with a full system rather than just the ML component can't use notebooks for example
second: pull out the algorithms so they can run from files. So every algorithm reads input data from the database, processes it, and saves back to the database. The read/write is using the Timeseries interface (https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/abstract_timeseries.py). If we provided an alternate implementation of the timeseries interface that worked with file data, we could just instantiate the algorithms with that data source and make it easier to run them in a notebook, for example.
Con: more complex dev work on our side. @shankari will probably need to do at least some heavy lifting on the timeseries implementation. @singhish can change the references and fix bugs
Pro: more modular, makes it easier for us to publish a challenge later, makes it easier for other potential collaborators to work with our data....
third?: would be good to experiment with a potential third solution. I think that @singhish said that there was an embedded JSON database, similar to sqlite that works with JSON files. Could using that database simplify the process of creating a new timeseries interface?
- Concretely, in that case, we would not have to create a new Timeseries implementation. Instead, we would simply load the data into the embedded DB and change the connection URL. The existing timeseries implementation would Just Work, and we would only need to move out the algorithms into a separate repo or something like that.
  - Questions to investigate for that:
    - can it work with files?
    - is it mongodb compatible (so can our existing mongodb queries in the implementation of the timeseries interface work with the new database without any changes)? so our change is as simple as replacing the connection URL?
    - how easy is it to load/retrieve data? Do we need a separate library?
- Pro: it has all the pros of the (2) with less work on our part?

shankari commented 3 years ago

Playing around with NEDB: an example notebook where:

you load the data from the dumped files into NEDB

you run some of the queries in the Timeseries implementation against NEDB manually

from the current interface it can be a little tricky to see what the queries look like. you can look at /var/tmp/webserver.log to find query example - e.g.

2021-03-26 20:14:27,545:DEBUG:123145524649984:Found 14 messages in response to query {'user_id': UUID('4aebf2e0-f097-4845-8652-2ada3a76dadd'), '$or': [{'metadata.key': 'statemachine/transition'}], 'metadata.write_ts': {'$lte': 1581048094.196096, '$gte': 1581005599.791271}}

make sure to change the UUID and the timestamps to data that you have actually loaded.

you use a NEDB connect URL and read from the Timeseries directly (examples of reading from the timeseries directly are in the e-mission server Timeseries_Sample.ipynb

shankari commented 3 years ago

to use the timeseries. you will need to add the e-mission server directory to your PYTHONPATH

shankari commented 3 years ago

Let's do the easy design first.

the pipeline code, in addition to its own implementation, relies on the storage and core modules of e-mission. So if we use the pipeline elsewhere, we need to import those modules from somewhere else. There are a couple of ways to do this:
- ask people to check out the server and include a PYTHONPATH, but this is lame
- pull out the core and storage modules into actual pypi modules that you can pip install at least from a repo. Then we can just add them as dependencies to the emissioneval setup script, and everything will just work. This is the way to go
how will people get the existing algorithms? Will they just check them out? Presumably they will not be in the mobilitynet repo. We also can't put them into pypi because people probably want to look at the source code
- how do other projects deal with their baseline implementations?

shankari commented 3 years ago

The goal of this project is to make it easier for others to come up with their own algorithms.

I think that the two options are:

use the existing notebook-based naive implementations as the baseline
- focus on getting the data structure to be easier to work with
- document the current baseline and the run instructions a lot better
- use the docker-compose option to run the e-mission analysis pipelines and pull the analysis results
- score existing e-mission algorithms using analysis results
rework the e-mission algorithms so that they can be the baseline
- refactoring the e-mission codebase to pull out the core and storage into pypi
- creating an alternate implementation of the Timeseries interface
- figuring out how to publish the e-mission algorithms so that they can be accessed both from notebooks and from production code
- change the algorithms to use the new publication structure

We have time to implement one option, not two. Which should it be?

singhish commented 3 years ago

I'm leaning the second option -- having the e-mission algorithms published would make it easier for others to come up with their own algorithms based on our own provided implementation. Refactoring the e-mission codebase so that core and storage are more compartmentalized would probably also make things easier for us to work with on our end in the long run.

Am I headed in the right direction?

shankari commented 3 years ago

I'm actually leaning towards the first option. The main difference is that it is by no means clear to me that anybody wants to come up with new algorithms based on our provided implementation. I think that people want to start with the data, explore it, and try out ML libraries (keras, sklearn, etc) on it.

We have one customer request: did he ask for mongodump (which would have used the database and the existing algorithms) or files (which would be more in line with the ML library approach)?

singhish commented 3 years ago

I see, that makes sense. The customer asked for files @shankari.

shankari commented 3 years ago

@singhish ok, let's try to get a second data point with you pretending to be a customer since you are not as close to the data as I am.

Let's say you want to enter a challenge in which you need to segment a trip into multiple unimodal segments. There is an existing analysis pipeline in which that is one step, and that looks like https://github.com/e-mission/e-mission-server/blob/master/emission/pipeline/intake_stage.py This stage is at lines 129 to 135.

Would you prefer to work with notebooks that had a simpler embedded baseline, or try to work with that code to understand and improve it?

singhish commented 3 years ago

This might be a personal thing, but as a developer, probably the latter, as working with code feels a lot nicer to me as opposed to dealing with the overhead associated with running a notebook. Data scientists might prefer the former though. @shankari

shankari commented 3 years ago

@singhish @jf87 if you are back from vacation, can you make the final call since you have actually tried to work with MobilityNet before. If we don't hear back from @jf87 by Monday, we will go with option (1) by maintainer fiat.

MobilityNet / mobilitynet.github.io

Determine how to support running analyses on the dataset #24