Open singhish opened 3 years ago
I can think of three options for how to do this:
first: keep the analysis in e-mission. To implement this, we would create a "development environment" that would include a docker-compose.yml
file that would launch a mongodb instance, an e-mission-server instance, load the data and run the pipeline. Then we can just retrieve the analysis results from the newly created e-mission-server instance and either compare directly or save to a file or ???. if people wanted to compare two algorithms, they would need to run two instances of the development environment (maybe a port 2323 and port 4545) and then we could compare against them.
Pro: minimal dev work on our side
Con: (1) JSON file export is useless, (2) hard for users to do interactive analysis, (3) harder for users to work with a full system rather than just the ML component can't use notebooks for example
second: pull out the algorithms so they can run from files. So every algorithm reads input data from the database, processes it, and saves back to the database. The read/write is using the Timeseries interface (https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/abstract_timeseries.py). If we provided an alternate implementation of the timeseries interface that worked with file data, we could just instantiate the algorithms with that data source and make it easier to run them in a notebook, for example.
Con: more complex dev work on our side. @shankari will probably need to do at least some heavy lifting on the timeseries implementation. @singhish can change the references and fix bugs
Pro: more modular, makes it easier for us to publish a challenge later, makes it easier for other potential collaborators to work with our data....
third?: would be good to experiment with a potential third solution. I think that @singhish said that there was an embedded JSON database, similar to sqlite that works with JSON files. Could using that database simplify the process of creating a new timeseries interface?
Playing around with NEDB: an example notebook where:
/var/tmp/webserver.log
to find query example - e.g.
2021-03-26 20:14:27,545:DEBUG:123145524649984:Found 14 messages in response to query {'user_id': UUID('4aebf2e0-f097-4845-8652-2ada3a76dadd'), '$or': [{'metadata.key': 'statemachine/transition'}], 'metadata.write_ts': {'$lte': 1581048094.196096, '$gte': 1581005599.791271}}
make sure to change the UUID and the timestamps to data that you have actually loaded.
Timeseries_Sample.ipynb
to use the timeseries. you will need to add the e-mission server directory to your PYTHONPATH
Let's do the easy design first.
pip install
at least from a repo. Then we can just add them as dependencies to the emissioneval setup script, and everything will just work. This is the way to goThe goal of this project is to make it easier for others to come up with their own algorithms.
I think that the two options are:
use the existing notebook-based naive implementations as the baseline
rework the e-mission algorithms so that they can be the baseline
We have time to implement one option, not two. Which should it be?
I'm leaning the second option -- having the e-mission algorithms published would make it easier for others to come up with their own algorithms based on our own provided implementation. Refactoring the e-mission codebase so that core and storage are more compartmentalized would probably also make things easier for us to work with on our end in the long run.
Am I headed in the right direction?
I'm actually leaning towards the first option. The main difference is that it is by no means clear to me that anybody wants to come up with new algorithms based on our provided implementation. I think that people want to start with the data, explore it, and try out ML libraries (keras
, sklearn
, etc) on it.
We have one customer request: did he ask for mongodump (which would have used the database and the existing algorithms) or files (which would be more in line with the ML library approach)?
I see, that makes sense. The customer asked for files @shankari.
@singhish ok, let's try to get a second data point with you pretending to be a customer since you are not as close to the data as I am.
Let's say you want to enter a challenge in which you need to segment a trip into multiple unimodal segments. There is an existing analysis pipeline in which that is one step, and that looks like https://github.com/e-mission/e-mission-server/blob/master/emission/pipeline/intake_stage.py This stage is at lines 129 to 135.
Would you prefer to work with notebooks that had a simpler embedded baseline, or try to work with that code to understand and improve it?
This might be a personal thing, but as a developer, probably the latter, as working with code feels a lot nicer to me as opposed to dealing with the overhead associated with running a notebook. Data scientists might prefer the former though. @shankari
@singhish @jf87 if you are back from vacation, can you make the final call since you have actually tried to work with MobilityNet before. If we don't hear back from @jf87 by Monday, we will go with option (1) by maintainer fiat.
Specifically, the notebooks ending in
_master
are failing due to how our current analysis pipeline is set up. A decision needs to be made regarding how to handle the analysis that is currently being done by using the analysis-related keys on E-Mission.