Clean up the TRB_label_assist code

shankari commented 1 year ago

This will:

help us modify it and make it better if we want to experiment with different algorithms, including trajectory matching
allow us to integrate into our production system
make sure that it is reproducible before the related paper gets published

The code is in: https://github.com/e-mission/e-mission-eval-private-data/tree/master/TRB_label_assist It works, but it requires us to use a custom fork of the e-mission-server repo to work

We should see the changes between the custom fork and the current e-mission-server master/gis branch See if the changes are still necessary or whether there are implementations in e-mission-server that have superceded them If they have not been superceded, we need to incorporate them in e-mission-server

So at the end, this should be reproducible against a core e-mission-server repo

Stretch goal: change this repo to work with docker containers that are built with the base e-mission-server container so that people don't have to install e-mission-server and set the EMISSION_SERVER_PATH and all the funky setup steps.

After we are done with this, we should integrate the random forest model + prediction into the e-mission server code for label assist.

either replace the existing clustering approach OR
better, keep both methods in the current plugin architecture and maybe even consider an ensemble down the road.

shankari commented 1 year ago

The current implementation of the mode building part of label assist code is at: https://github.com/e-mission/e-mission-server/tree/master/emission/analysis/modelling/trip_model

There are previous implementations in https://github.com/e-mission/e-mission-server/tree/master/emission/analysis/modelling/tour_model and https://github.com/e-mission/e-mission-server/tree/master/emission/analysis/modelling/tour_model_first_only

tour model was written by a summer intern and is very bad code
tour_model_first_only was a last minute hack by me to salvage some of the interns code and get a feature working in production
trip_model is a rewrite of tour_model_first_only to be more modular. This is what is in production right now.

The custom branch uses some code from tour_model or tour_model_first_only since the TRB_label_assist work was being done in parallel with the trip_model rewrite.

So we should:

see where that functionality is in trip_model
if it exists, replace its usage in TRB_label_assist with the implementation from trip_model
if it doesn't exist, we add it to trip_model and then replace its usage in TRB_label_assist

Eventually we should take the new and improved model(s) from TRB label assist and put the into trip_model

For a high level summary of what the label assist model does, please see this poster: https://www.osti.gov/biblio/1922617

humbleOldSage commented 1 year ago

Two files, Clustering.py and models.py needs to be updated for us to port the entire code to just our main repo. I had two questions:

Q1 Among the vast number of models in the models.py file, we are currently using ClusterExtrapolationClassifier, ForestClassifier and NaiveBinningClassifier in our performance_eval.py file. Out of these three , only NaiveBinningClaasifier has dependence on the code from tour_model. But there's one another model ( which we are currently not using) RefactoredNaiveCluster which is dependent on tour_model. Should we, for now, let model be dependent on tour_model or port this to trip model ?

Q2 Is there any way I can check the the working of trip_model by running it ? Is there any module that's using it that I can/should run ??

NOTE : regenerate_classification_performance_results.py. is currently standalone and is not being used by any of the notebooks.

shankari commented 1 year ago

@humbleOldSage I just want to clarify that, while we will eventually port the code around the new models in the e-mission-eval-private-data repo to the e-mission-server repo, the first goal is to port the code required by the special branch of e-mission-server that this repo to the master branch of e-mission-server.

It works, but it requires us to use a custom fork of the e-mission-server repo to work

We should see the changes between the custom fork and the current e-mission-server master/gis branch See if the changes are still necessary or whether there are implementations in e-mission-server that have superceded them

clustering and models and files in e-mission-eval-private-data not the server.

shankari commented 1 year ago

I apologize for not reading your comment more carefully. You are correct that if there is a mode (e.g. RefactoredNaiveCluster) which is dependent on the tour model, we could choose to remove it instead of migrating it.

So I would suggest:

verify that RefactoredNaiveCluster is indeed not used (put results of grep)
migrate the models that are used and see if the changes cover RefactoredNaiveCluster as well.
- If so, we might as well migrate it (since it will only involve changing the calls)
- If not, we should remove it

Let's not keep unused, bitrotted code around

shankari commented 1 year ago

Q2 Is there any way I can check the the working of trip_model by running it ? Is there any module that's using it that I can/should run ??

Trip model has unit tests, and a script to launch it against a database. If you run it against the full CEO dataset, I would suggest launching over the weekend so that you can still use your computer 😄

$ grep -rl trip_model emission | grep -v __pycache__
...
emission/analysis/classification/inference/labels/inferrers.py
emission/analysis/modelling/trip_model/model_type.py
emission/analysis/modelling/trip_model/config.py
emission/analysis/modelling/trip_model/model_storage.py
emission/analysis/modelling/trip_model/run_model.py
emission/analysis/modelling/trip_model/greedy_similarity_binning.py
emission/tests/modellingTests/TestBackwardsCompat.py
emission/tests/modellingTests/TestRunGreedyIncrementalModel.py
emission/tests/modellingTests/TestRunGreedyModel.py
emission/tests/modellingTests/modellingTestAssets.py
emission/tests/modellingTests/TestGreedySimilarityBinning.py
...

NOTE : regenerate_classification_performance_results.py is currently standalone and is not being used by any of the notebooks.

That is by design - if you check the history:

Note that these results can take very long (> 2 days) to regenerate Running them from a notebook will either not print logs, or will print so many logs that the notebook buffers will be overwhelmed

Moving the computation code out to a separate script allows us to more easily redirect the output to a file and track the progress of the execution

humbleOldSage commented 1 year ago

I missed an indirect dependence of RefactoredNaiveCluster , so we'll have to work on it as well. Apologies. The last two results of grep are calls from one of the models we are using, ClusterExtrapolationClassifier .

$ grep -r 'RefactoredNaiveCluster' TRB_label_assist/models.py
TRB_label_assist/models.py:class RefactoredNaiveCluster(Cluster):
TRB_label_assist/models.py:        logging.info("PERF: Initializing RefactoredNaiveCluster")
TRB_label_assist/models.py:        logging.info("PERF: Fitting RefactoredNaiveCluster with size %s" % len(train_df))
TRB_label_assist/models.py:        logging.info("PERF: Predicting RefactoredNaiveCluster for %s" % len(test_df))
TRB_label_assist/models.py:                logging.info("PERF: RefactoredNaiveCluster Working on trip %s/%s" % (idx, len(self.test_df)))
TRB_label_assist/models.py:            self.end_cluster_model = RefactoredNaiveCluster(loc_type='end',
TRB_label_assist/models.py:                self.start_cluster_model = RefactoredNaiveCluster(

humbleOldSage commented 1 year ago

I have been working with the sample data that the script generates for testing for now. It should be enough for my purpose. If it doesn't I'll run it on full data set over the weekend. Also, won't need to run all of them, just need 1 test file, the TestGreedySimilarityBinning.py one.

shankari commented 1 year ago

It should be enough for my purpose.

Curious - what are you running the existing tests for? Is it to test the new functionality that you are moving into trip_model? If so, you should really write a new unit test (which can use similar sample data) instead of ad-hoc testing.

humbleOldSage commented 1 year ago

While I move the functionalities, I am just using them to verify my understanding of the code-flow and I/O to trip_model modules. That's why I felt a small randomly generated sample should serve the purpose. Once done, I do plan on writing new unit tests and may even test on the full CEO dataset.

humbleOldSage commented 1 year ago

The dependence of the 2 files as mentioned previously(clustering.py and models.py)are on the following 2 modules in the custom branch:

bin_helper module in theSimilarity Class from similarity.py file.
save_models and create_user_input_map modules from the build_save_model.py file.
fit module again from the Similarity Class from similarity.py file.

humbleOldSage commented 1 year ago

For i. , I probed the trip_model folder in the main branch and we surely have this functionality already implemented in the main branch.

I can say this because to understand the functionalities and flow of data on the custom branch side, I ran the dependent notebooks on the CEO dataset - i.e. clustering_examples.ipynb. Starting with this file, I noticed that it imports mapping.py, which, in turn, imports clustering.py, which in turn imports import emission.analysis.modelling.tour_model_extended.similarity as eamts

I then looked at the list of diffs in the extended branch https://github.com/e-mission/e-mission-server/compare/master...hlu109:e-mission-server:eval-private-data-compatibility pulled the changes, and compared them using diff. The result is attached (as a gzip file).

This notebook only uses eamts.Similarity, which clusters based on O-D of the trips using the ecc.calDistance.

To match this understanding on the main branch side, I ran the existing unit test specifically for the modeling modules using the below command.

PYTHONPATH=. python -m unittest emission/tests/modellingTests/TestGreedySimilarityBinning.py

The trip_model rewrite and the eamts.Similarity were passing origin/destination latitude and longitudes and relied on od_similarity.py file for similarity computation,specifically the ecc.calDistance function. This helped me reach the conclusion that the functionality and the end result of these two had an overlap.

However, both these functions are working on different data-forms, i.e., custom branch uses data in data-frame format while processing whereas the main branch uses them in Entry type wrapper. My first thought was to convert the data frame to Entry type data and use it . But on Further searching, I could figure that the data was being converted from Entry type to data_frame at some-point before similarity computations and then visualizing them. Again converting it back would be unnecessary processing .

So we just need to change the way data is being passed to functions, without breaking any other functionalities.

We have not yet evaluated what are the other changes between tour_model_extended and tour_model, where they are used, and how they should be migrated to trip_model. After finishing this migration (ETA EOD tomorrow), I will submit a draft PR and start on the other two.

~All-in-all, we won't have to do new implementations here, just a few changes should do.~

humbleOldSage commented 1 year ago

To begin with the changes for (i) at https://github.com/e-mission/e-mission-eval-private-data/issues/35#issuecomment-1670358021 , we start with the clustering.py file . I can see that the end result of the "naive" algorithm under the add_loc_clusters function is a column being added to the data frame having group labels ( numeric value ), at line that reads:

loc_df.loc[:, f"{loc_type}_{alg}_clusters_{r}_m"] = label

where

{loc_type} -> start or end or trip
{alg} -> DBSCAN or naive or OPTIC or fuzzy or mean_shift.
{r } -> stands for the grouping radius, takes value 50,100 and 150 currently .

These group labels are generated using the bin_helper function present in the eamts.Similarity class. bin_helper calls match which calls distance_helper which calls within_radius which calls ecc.calDistance to calculate the distance between two points.

These generated labels are stored in two places:

Inside the eamts.Similarity class they are stores in self.start_bins or self.end_bins or self.trip_bins depending on value of loc_type variable.
In the dataframe, the same results is stored in one of three new column amongst start_bin/ end_bin / trip_bin and the other two are filled with NaNs.

The additional column (columns if we pass multiple r values), that I referred to in the beginning of this comment, copies labels from (i) (above) and appends into the data frame.

IF we are able to generate these labels from the existing trip_model in the main branch, we'll be able to remove our dependence on the custom branch.

shankari commented 1 year ago

IF we are able to generate these labels from the existing trip_model in the main branch, we'll be able to remove our dependence on the custom branch.

And what would it take to generate these labels from the existing trip_model in the main branch?

humbleOldSage commented 1 year ago

In the main branch ,trip_model has a similar functionality implemented in the _assign_bins function in the GreedySimilarityBinning class in greedy_similarity_binning.py file.

However, rather than storing bin labels of each trip next to it, _assign_bins groups trips by their bin labels. In general, the data in GreedySimilarityBinning class takes the form:

        {
            bin_id: {
                "feature_rows": [
                    [f1, f2, .., fn],
                    ...
                ],
                "labels": [
                    { label1: value1, ... }
                ],
                "predictions": [
                    { "labels": { label1: value1, ... }, 'p': p_val }
                ]
            }
        }

        where
        - bin_id:  str    index of a bin containing similar trips, as a string
                          (string type for bin_id comes from mongodb object key type requirements)
        - f_x:     float  feature value (an ordinate such as origin.x)
        - label_x: str    OpenPATH user label category such as "mode_confirm"
        - value_x: str    user-provided label for a category
        - p_val:   float  probability of a prediction, real number in [0, 1]

        :param config: if provided, a manual configuration for testing purposes. these
                       values should be provided by the config file when running OpenPATH.
                       see config.py for more details.

For e.g., below is one such groups saved by _assign_bins on a dummy data :

{'0': {'feature_rows':[[-4.958129127344314e-05, 8.768060347937052e-06, 0.999720312837768, 1.0001227632472012], 
                       [-0.0002885431635348521, 4.2737498370605324e-05, 1.0002114424380646, 0.9998622709144951],
                       [0.00021187675902559317, -0.00031421641939745657, 0.9998863315113582, 1.0000619592251845], 
                       [0.00041447629708705905, -0.0002639625392563102, 1.0001666270256024, 0.9995808207008069], 
                       [-2.651656515482855e-05, 0.0004016395427976644, 0.9999933052671837, 0.9999978850795779]],
            'labels': [{'mode_confirm': 'transit', 'replaced_mode': 'drive', 'purpose_confirm': 'school'},
                       {'mode_confirm': 'transit', 'replaced_mode': 'drive', 'purpose_confirm': 'work'}, 
                       {'mode_confirm': 'bike', 'replaced_mode': 'drive', 'purpose_confirm': 'work'}, 
                       {'mode_confirm': 'transit', 'replaced_mode': 'drive', 'purpose_confirm': 'school'},
                       {'mode_confirm': 'bike', 'replaced_mode': 'drive', 'purpose_confirm': 'school'}], 
             'predictions': []},

 '1': {'feature_rows': [[0.06332302074892233, 0.02417860787080751, 1.0195018436957517, 0.9714679398747385]], 
            'labels': [{'mode_confirm': 'bike', 'replaced_mode': 'drive', 'purpose_confirm': 'work'}], 
             'predictions': []},

 '2': {'feature_rows': [[0.055288906505610205, 0.016151126202667912, 1.0453033636120435, 0.9712560053027818]], 
          'labels': [{'mode_confirm': 'walk', 'replaced_mode': 'drive', 'purpose_confirm': 'home'}],
           'predictions': []}, 

'3': {'feature_rows': [[0.0023381228366961128, -0.008375891067784268, 0.9628717886188353, 0.9681751183057479]],
           'labels': [{'mode_confirm': 'transit', 'replaced_mode': 'drive', 'purpose_confirm': 'school'}], 
          'predictions': []},
.
 .
.
.
}

Here 0,1,2 and 3 are bin labels which have trips ( actually they are features of trips ) belonging to that bin, grouped as a list.

We need _assign_bins function to also give the trips with their labels.

humbleOldSage commented 1 year ago

One possible way is ( to be included in PR ) the line that gets the matching bin for the point is :

bin_id = self._find_matching_bin_id(trip_features) here

Beyond this line, grouping of trips a.t. labels take place. I can introduce a self.binLabel class variable of type list that'll append the bin Id for the trip being processed. Once we iterate all the trips, we'll have their labels in a list which we can return to TRB_label_assist and can append to the data frame.

humbleOldSage commented 1 year ago

Changes that need to be done in the clustering.py in TRB_label_assist module to link this with trip_model in main branch would be :

Initiate a GreedySiilarityBinning type model with configs as :

        model_config = {
            "metric": "od_similarity",
            "similarity_threshold_meters": r,  # meters,
            "apply_cutoff": False,
            "incremental_evaluation": False
        }

Pass the data to the fit function in the required format ,i.e., List[ecwc.Confirmedtrip]

Currently, add_loc_clusters has data in df form. Need to get data in Confirmedtrip format so that it can be passed to fit function.

Retrieve labels stored in self.tripLabels in GreedySiilarityBinning class

humbleOldSage commented 1 year ago

After discussions on this PR, particularly this suggestion focuses on reducing load on production side e-mission-server. So we'll move the label computations to clustering.py.

More precisely, our aim, now, would be on clustering.py side. We'll retrieve

trip feature --> binId( or label or bin No)

like mappings from

binId ---> [list of trip_feature]

like structure which GreedySimilarityBinning class provides us. Then further, map every trip_feature( along with the binID) to the correct trip in dataframe.

For e.g. :

From grouped bins which look like this

{'0': {'feature_rows':[[-4.958129127344314e-05, 8.768060347937052e-06, 0.999720312837768, 1.0001227632472012], 
                       [-0.0002885431635348521, 4.2737498370605324e-05, 1.0002114424380646, 0.9998622709144951],
                       [0.00021187675902559317, -0.00031421641939745657, 0.9998863315113582, 1.0000619592251845], 
                       [0.00041447629708705905, -0.0002639625392563102, 1.0001666270256024, 0.9995808207008069], 
                       [-2.651656515482855e-05, 0.0004016395427976644, 0.9999933052671837, 0.9999978850795779]],
            'labels': [{'mode_confirm': 'transit', 'replaced_mode': 'drive', 'purpose_confirm': 'school'},
                       {'mode_confirm': 'transit', 'replaced_mode': 'drive', 'purpose_confirm': 'work'}, 
                       {'mode_confirm': 'bike', 'replaced_mode': 'drive', 'purpose_confirm': 'work'}, 
                       {'mode_confirm': 'transit', 'replaced_mode': 'drive', 'purpose_confirm': 'school'},
                       {'mode_confirm': 'bike', 'replaced_mode': 'drive', 'purpose_confirm': 'school'}], 
             'predictions': []},

 '1': {'feature_rows': [[0.06332302074892233, 0.02417860787080751, 1.0195018436957517, 0.9714679398747385]], 
            'labels': [{'mode_confirm': 'bike', 'replaced_mode': 'drive', 'purpose_confirm': 'work'}], 
             'predictions': []},

 '2': {'feature_rows': [[0.055288906505610205, 0.016151126202667912, 1.0453033636120435, 0.9712560053027818]], 
          'labels': [{'mode_confirm': 'walk', 'replaced_mode': 'drive', 'purpose_confirm': 'home'}],
           'predictions': []}, 

'3': {'feature_rows': [[0.0023381228366961128, -0.008375891067784268, 0.9628717886188353, 0.9681751183057479]],
           'labels': [{'mode_confirm': 'transit', 'replaced_mode': 'drive', 'purpose_confirm': 'school'}], 
          'predictions': []},
.
 .
.
.
}

we retrieve tripFeature with theirs bin ids as below.

[-4.958129127344314e-05, 8.768060347937052e-06, 0.999720312837768, 1.0001227632472012]  ---> {0}
[-0.0002885431635348521, 4.2737498370605324e-05, 1.0002114424380646, 0.9998622709144951]. ---> {0}
[0.00021187675902559317, -0.00031421641939745657, 0.9998863315113582, 1.0000619592251845],  --->{0}
[0.00041447629708705905, -0.0002639625392563102, 1.0001666270256024, 0.9995808207008069],  --->{0}
[-2.651656515482855e-05, 0.0004016395427976644, 0.9999933052671837, 0.9999978850795779]],  --->{0}
[0.06332302074892233, 0.02417860787080751, 1.0195018436957517, 0.9714679398747385] ---> {1}
[0.055288906505610205, 0.016151126202667912, 1.0453033636120435, 0.9712560053027818] ---> {2}
[0.0023381228366961128, -0.008375891067784268, 0.9628717886188353, 0.9681751183057479 ---> {3}

Once we have tripFeature-bin mappings, we'll have to search for each trip (using tripFeature that we have) in the datagram loc_type and then add bin label in the respective row.

This would result in no changes in e-mission-server and all computations being on the e-mission-eval-private-data.

humbleOldSage commented 1 year ago

In the clustering_examples.ipynb, the data is loaded in Entry format and then converted to df .This df is passed around the modules for computations and visalisations.

On the other hand , the trip_model's fit function in greedy_similarity_binning.py accepts List of ecwc.Confirmedtrip type data.

Looks like we can connect these two by additionally passing around ConfirmedTripType type data (along with df ) inside the notebook and then passing only ConfirmedTripType type data (for generating bin labels to) fit function . Once we get bin labels, we follow the steps in the previous comment to add bin labels to respective trips in the data frame.

humbleOldSage commented 1 year ago

In the process to link them , I find that the load data cell, esta.TimeSeries.get_time_series is called, which invoke an object of BuiltinTimeSeries . Then get_data_df from BuiltinTimeSeries is called, which called find_entries and to_data_df. to_data_df is responsible for converting Entry to df by calling _to_df_entry.

Once back from these calls in the load_data cell, at this point we have dataframe. We can additionally get the Entry type data as well, but filter_labeled_trips and expand_userinputs are working specifically on df in the notebook before passing df to tour_model_extended's similarity. Another way would be to convert df to ConfirmedTripType data right before tour_model_extended's similarity (which will be replaced by trip_model's fit)function call.

humbleOldSage commented 1 year ago

After searching for a while, I can conclude that there is no functionality implemented to convert df to ConfirmedTrip type. However, one way I can think of that can help us fulfill our goal using already implemented functionalities would be to generate a dummy entry using create_fake_entry function in /emission/core/wrapper/entry.py and then replace dummy entries with our data.

This is still a hypothesis and needs to be tested.

humbleOldSage commented 1 year ago

To test this hypothesis, I can check the column names in trips generated by generate_mock_trips in TestGreedySimilarityBinning.py (which are unit tests basically) and match if I have same columns in dataframe in clustering.py. If they are not same, I can check also reference the columns during Entry type to df conversion and see if certain columns were dropped for some reason there.

shankari commented 1 year ago

@humbleOldSage one of the entries in get_data_df is the object id (stored as _id). So you should be able to, in the analysis code, keep the list of trip entries around, and once the filtering is done, filter the list of entries based on the _ids retained in the filtered df. That way, you can pass in real confirmed trip objects instead of fiddling around with conversions - confirmed_trip -> data_df -> fake confirmed trip.

If we are going to use dataframes extensively, we can also consider changing trip_model to take in a dataframe instead of a list of confirmed trips. I think that Rob structured the module this way so that the feature extraction could be modular. But if we pass in a dataframe, then the feature extraction would be column selection and/or new column creation from existing columns, possibly by using apply

humbleOldSage commented 1 year ago

If we do decide to move to df in trip_model, it would just be column extraction (columns with sat_lat, start_lon , end_lat and end_lon). However, we'll have to rewrite fit , which includes calls to _apply_cutoff() ( current configs pass apply_cutoff = False so not being used but is part of fit function) and _assign_bins ( which includes feature extraction as well). We can keep this option open in case nothing else works out.

Using object id, post filtering, seems like a good alternate to conversions. I'll pick this route for now and move forward.

shankari commented 1 year ago

Just want to clarify that:

the model fitting is also part of the OpenPATH code and integrated with it. We do not have any other dependencies on this interface and can change it at will
the current implementation uses the bin-based model, but we should switch to the random-forest-based model, at least for label based deployments, after this cleanup is done. The random forest modeling will work directly on dataframes, so the fact that _assign_bins takes in trips becomes less important

again fine with taking the object id post filtering is fine for now; just realize that you might need to polish it again downstream

humbleOldSage commented 1 year ago

Then I think it makes more sense to switch to df based processing from now itself. I'll rollback on the Trip changes I did on my end and take the data-frame way.

humbleOldSage commented 1 year ago

@shankari I might be wrong here , but does the following make sense ?

From my understanding, I feel if we take the df way, then from TRB_label_assist's POV, this essentially mean we can replace trip_model ( of the main branch) with tour_model_Extended (from hlu09's( custom) branch ) . This is because all the operations in custom branch ( even data structures declared) are already using data-frames. If data-frame friendly operations in trip_model are what we aim to achieve, then currently they are in tour_model_extended . No need to reimplement all of trip_model's function in df-friendly format to make 'TRB_label_assist' work . Simply replace 'tour_model_Extended' in place of trip_model.

However, It is ESSENTIAL to point out that my understanding above is from TRB_label_assist's POV. Meaning, the implementations in tour_model_Extended will make TRB_label_assist to work. But ..

It might be possible that there are dependencies on trip_model from modules other than TRB_label_assist ( which I am not aware of currently) which might not even be implemented in custom branch's tour_model_extended. So, if we decide to replace trip_model with tour_model_Extended, we'll have

add df versions of every function in trip_model to 'tour_extended_model' and call this upgraded_tour_model_Extended .
atleast change 'Entry'/ 'ConfirmedTrips' type data to df before df-version-of trip_model (or upgraded_tour_model_Extended) can use it.

The task, I believe, then MIGHT translate to upgrading 'tour_model_extended' to be an exact df version of trip_model . Every function of trip_model implemented in df format.

humbleOldSage commented 1 year ago

A visual explanation :

shankari commented 1 year ago

thank you for the visual representation. I agree with the first two images, but I don't understand/don't agree with the third. What do you mean by an upgraded tour_model replacing the trip_model?

The interface of trip_model will take a dataframe, so I guess they are upgraded in that sense. But I don't think that it is a good idea to remove all the software engineering around trip model and just replace it with the expanded tour model.

Again, having the label assist use trip_model instead of tour model was intended as an easy-to-implement first step to get you familiar with the codebase.

From the original project definition: https://github.com/e-mission/e-mission-eval-private-data/issues/35#issue-1819040402

After we are done with this, we should integrate the random forest model + prediction into the e-mission server code for label assist.

Trip model is structured so that it can support multiple algorithms in parallel in a modular fashion. The bin-based algorithm that is currently implemented in expanded tour model is just one of them. The goal is to just change that implementation to the expanded version, not replace trip_model entirely

humbleOldSage commented 1 year ago

The figures generated after 97406c437f5fedb37abf96f65c04480c6c6f78b7 , are different from the paper. Below are the screenshots . To the left are screenshots from the commit I mentioned. To the right are the ones from the paper

SUBURBAN AREA. :

50m

100m

150m

humbleOldSage commented 1 year ago

COLLEGE CAMPUS :

In case of college campus, I am unable to determine what area of the map is the screenshot (in the paper) from. However, even if I compare the old custom branch and the recent commit ones, they differ.

shankari commented 1 year ago

It is from near LA - between LA and Irvine

humbleOldSage commented 1 year ago

I think one of the reason, why the labels are different is due to the format of the input to the ecc.calDistance function. The custom branch passes the points in [lon,lat] format, the main branch is using [lat,lon] format for this. However, by itself, this doesn't seem to have solved the issue. I am still looking for what else could be doing this.

humbleOldSage commented 1 year ago

The other thing that was causing the issues is the similar function in eamms.SimilarityMetric module, particularly line 40. It calculates similarity based on both origin and destination when we might require origin/destination/origin-and-destination points.

On looking through these two things collectively, its just the second reason that was causing the problem. The first one is correct in itself.

Once, I do the just the second changes, the suburban results match for r= 50,100,150 match with the paper. Let me check the college ones as well.

humbleOldSage commented 1 year ago

It works for college campus as well.

shankari commented 1 year ago

In general, for this, it is fine to make the change in the master branch because we don't have to use it in production, but we can if we want to. We should just make sure that it is configurable so we can continue using O-D on production but use the other options for evaluation only for now.

In the timeseries, we did not want to make the change in master because there isn't only one set of entries we want to read so the change was not modular enough.

In this case, trip_model is set up to be modular, so we can just use that by adding new modules.

humbleOldSage commented 1 year ago

With the changes in fit function, till now, there are dependencies in almost all notebooks and files that needs to be changed.

classification_performance.ipynb :

has dependence on cv_for_all_algs from performance_eval, which calls to fit functions of ClusterExtrapolationClassifier , which calls RefractoredNaivebinning, which is dependent on eamts.Similarity of the custom branch.

I've shift this to eamtg.GreedySimilarityBinning ( as done in clustering.py). In doing so, one way is I'll change the parameters fit function of all three ClusterExtrapolationClassifier, ForestClassifier, and NaiveBinningClassifier to receive the Entry type data of the dataframe from the notebook.Not all these three need the entry type data.

The other and better way is to check if the model_ is of instance ClusterExtrapolationClassifier and then call a different fit function for it.

humbleOldSage commented 1 year ago

While running regenerate_classification_performance_results.py , the following message occurs while running NaiveBinningClassifier model (named as fixed-width (O-D))

running cross validation for model: fixed-width (O-D)
2023-08-28 03:45:35,441 INFO ------ START: predictions for user 00db212b-c8d0-44cd-8392-41ab4065e603 and model <class 'models.NaiveBinningClassifier'>
2023-08-28 03:45:35,454 DEBUG num trips 724
2023-08-28 03:45:35,471 INFO ----- Building model <class 'models.NaiveBinningClassifier'> for fold 0
2023-08-28 03:45:35,472 INFO PERF: Initializing NaiveBinningClassifier
2023-08-28 03:45:35,512 INFO About to fit the model <class 'models.NaiveBinningClassifier'>
2023-08-28 03:45:35,512 INFO PERF: Fitting NaiveBinningClassifier
2023-08-28 03:45:43,605 INFO About to generate predictions for the model <class 'models.NaiveBinningClassifier'>
2023-08-28 03:45:43,605 INFO PERF: Predicting NaiveBinningClassifier
2023-08-28 03:45:43,669 DEBUG getting key model_type in config
2023-08-28 03:45:43,669 DEBUG getting key model_storage in config
2023-08-28 03:45:43,691 DEBUG no GREEDY_SIMILARITY_BINNING model found for user 00db212b-c8d0-44cd-8392-41ab4065e603
2023-08-28 03:45:43,691 DEBUG In predict_cluster_confidence_discounting: n=-1; returning as-is
.
.
.
.
.

2023-08-28 03:45:44,545 DEBUG getting key model_type in config
2023-08-28 03:45:44,545 DEBUG getting key model_storage in config
2023-08-28 03:45:44,549 DEBUG no GREEDY_SIMILARITY_BINNING model found for user 00db212b-c8d0-44cd-8392-41ab4065e603
2023-08-28 03:45:44,549 DEBUG In predict_cluster_confidence_discounting: n=-1; returning as-is
2023-08-28 03:45:44,549 DEBUG getting key model_type in config
2023-08-28 03:45:44,550 DEBUG getting key model_storage in config
2023-08-28 03:45:44,554 DEBUG no GREEDY_SIMILARITY_BINNING model found for user 00db212b-c8d0-44cd-8392-41ab4065e603
2023-08-28 03:45:44,554 DEBUG In predict_cluster_confidence_discounting: n=-1; returning as-is
2023-08-28 03:45:44,556 ERROR skipping user 00db212b-c8d0-44cd-8392-41ab4065e603 due to error: ValueError('attempt to get argmax of an empty sequence')
Traceback (most recent call last):
  File "/Users/ssaini/Documents/GitHub/e-mission-eval-private-data/TRB_label_assist/performance_eval.py", line 243, in cv_for_all_users
    min_samples=min_samples)
  File "/Users/ssaini/Documents/GitHub/e-mission-eval-private-data/TRB_label_assist/performance_eval.py", line 179, in cross_val_predict
    pred_df = model_.predict(test_trips)
  File "/Users/ssaini/Documents/GitHub/e-mission-eval-private-data/TRB_label_assist/models.py", line 192, in predict
    proba_df = self.predict_proba(test_df)
  File "/Users/ssaini/Documents/GitHub/e-mission-eval-private-data/TRB_label_assist/models.py", line 742, in predict_proba
    axis=1)
  File "/Users/ssaini/miniconda-4.8.3/envs/emission-private-eval/lib/python3.7/site-packages/pandas/core/frame.py", line 8861, in idxmax
    indices = nanops.nanargmax(self.values, axis=axis, skipna=skipna)
  File "/Users/ssaini/miniconda-4.8.3/envs/emission-private-eval/lib/python3.7/site-packages/pandas/core/nanops.py", line 71, in _f
    return f(*args, **kwargs)
  File "/Users/ssaini/miniconda-4.8.3/envs/emission-private-eval/lib/python3.7/site-packages/pandas/core/nanops.py", line 924, in nanargmax
    result = values.argmax(axis)
ValueError: attempt to get argmax of an empty seque

On investigating, it seems model was fit and saved correctly. predict_proba is called in NaiveBinningClassifier , which has a call for predict_cluster_confidence_discounting in emission.analysis.classification.inference.labels.inferrers . Inside this is a call for predict_labels_with_n which is initiating Greedy similarity binning call. However a different model , similarity.similarity from tour_model , was used for fitting the data and saved. This seems to be causing the issue

humbleOldSage commented 1 year ago

To tackle this, we can use Greedy similarity from trip_model (with fixed radius) for fitting and save that model. And then load would work similarly.

shankari commented 1 year ago

The other and better way is to check if the model_ is of instance ClusterExtrapolationClassifier and then call a different fit function for it.

I am not sure that this is better. Your instinct is to have if/else everywhere. that is not necessarily always correct. Consistent interfaces are sometimes better than avoiding new code (but not always)

shankari commented 1 year ago

To tackle this, we can use Greedy similarity from trip_model (with fixed radius) for fitting and save that model. And then load would work similarly.

So have you actually done this yet?

humbleOldSage commented 1 year ago

Yeah. This is already done in commit : a34836faf364f9548d3ab5373c8efd0d1e729a53. Before the commit, regenerate_classification_performance_results.py file won't run and threw error due to reason mentioned above. Its now working and generated the Cross validation CSVs . Let me run the classification_performance.ipynb again and post the performance graphs here.

humbleOldSage commented 1 year ago

Plots below are from the paper and from the latest run of the code . They match.

Accuracy and Weighted F-score for each model :

Plot from the paper

Plot from classification_performance.ipynb after the changes

Weighted F-score for purpose prediction

Plot from the paper

Plot from classification_performance.ipynb after the changes

e-mission / e-mission-eval-private-data

Clean up the TRB_label_assist code #35