Open shankari opened 1 year ago
The current implementation of the mode building part of label assist code is at: https://github.com/e-mission/e-mission-server/tree/master/emission/analysis/modelling/trip_model
There are previous implementations in https://github.com/e-mission/e-mission-server/tree/master/emission/analysis/modelling/tour_model and https://github.com/e-mission/e-mission-server/tree/master/emission/analysis/modelling/tour_model_first_only
The custom branch uses some code from tour_model
or tour_model_first_only
since the TRB_label_assist work was being done in parallel with the trip_model rewrite.
So we should:
Eventually we should take the new and improved model(s) from TRB label assist and put the into trip_model
For a high level summary of what the label assist model does, please see this poster: https://www.osti.gov/biblio/1922617
Two files, Clustering.py and models.py needs to be updated for us to port the entire code to just our main repo. I had two questions:
Q1 Among the vast number of models in the models.py file, we are currently using ClusterExtrapolationClassifier, ForestClassifier and NaiveBinningClassifier in our performance_eval.py file. Out of these three , only NaiveBinningClaasifier has dependence on the code from tour_model. But there's one another model ( which we are currently not using) RefactoredNaiveCluster which is dependent on tour_model. Should we, for now, let model be dependent on tour_model or port this to trip model ?
Q2 Is there any way I can check the the working of trip_model by running it ? Is there any module that's using it that I can/should run ??
NOTE : regenerate_classification_performance_results.py. is currently standalone and is not being used by any of the notebooks.
@humbleOldSage I just want to clarify that, while we will eventually port the code around the new models in the e-mission-eval-private-data
repo to the e-mission-server
repo, the first goal is to port the code required by the special branch of e-mission-server
that this repo to the master
branch of e-mission-server
.
It works, but it requires us to use a custom fork of the e-mission-server repo to work
We should see the changes between the custom fork and the current e-mission-server master/gis branch See if the changes are still necessary or whether there are implementations in e-mission-server that have superceded them
clustering
and models
and files in e-mission-eval-private-data
not the server.
I apologize for not reading your comment more carefully. You are correct that if there is a mode (e.g. RefactoredNaiveCluster) which is dependent on the tour model, we could choose to remove it instead of migrating it.
So I would suggest:
RefactoredNaiveCluster
is indeed not used (put results of grep
)Let's not keep unused, bitrotted code around
Q2 Is there any way I can check the the working of trip_model by running it ? Is there any module that's using it that I can/should run ??
Trip model has unit tests, and a script to launch it against a database. If you run it against the full CEO dataset, I would suggest launching over the weekend so that you can still use your computer 😄
$ grep -rl trip_model emission | grep -v __pycache__
...
emission/analysis/classification/inference/labels/inferrers.py
emission/analysis/modelling/trip_model/model_type.py
emission/analysis/modelling/trip_model/config.py
emission/analysis/modelling/trip_model/model_storage.py
emission/analysis/modelling/trip_model/run_model.py
emission/analysis/modelling/trip_model/greedy_similarity_binning.py
emission/tests/modellingTests/TestBackwardsCompat.py
emission/tests/modellingTests/TestRunGreedyIncrementalModel.py
emission/tests/modellingTests/TestRunGreedyModel.py
emission/tests/modellingTests/modellingTestAssets.py
emission/tests/modellingTests/TestGreedySimilarityBinning.py
...
NOTE : regenerate_classification_performance_results.py
is currently standalone and is not being used by any of the notebooks.
That is by design - if you check the history:
Note that these results can take very long (> 2 days) to regenerate Running them from a notebook will either not print logs, or will print so many logs that the notebook buffers will be overwhelmed
Moving the computation code out to a separate script allows us to more easily redirect the output to a file and track the progress of the execution
I missed an indirect dependence of RefactoredNaiveCluster
, so we'll have to work on it as well. Apologies. The last two results of grep are calls from one of the models we are using, ClusterExtrapolationClassifier
.
$ grep -r 'RefactoredNaiveCluster' TRB_label_assist/models.py
TRB_label_assist/models.py:class RefactoredNaiveCluster(Cluster):
TRB_label_assist/models.py: logging.info("PERF: Initializing RefactoredNaiveCluster")
TRB_label_assist/models.py: logging.info("PERF: Fitting RefactoredNaiveCluster with size %s" % len(train_df))
TRB_label_assist/models.py: logging.info("PERF: Predicting RefactoredNaiveCluster for %s" % len(test_df))
TRB_label_assist/models.py: logging.info("PERF: RefactoredNaiveCluster Working on trip %s/%s" % (idx, len(self.test_df)))
TRB_label_assist/models.py: self.end_cluster_model = RefactoredNaiveCluster(loc_type='end',
TRB_label_assist/models.py: self.start_cluster_model = RefactoredNaiveCluster(
I have been working with the sample data that the script generates for testing for now. It should be enough for my purpose. If it doesn't I'll run it on full data set over the weekend. Also, won't need to run all of them, just need 1 test file, the TestGreedySimilarityBinning.py
one.
It should be enough for my purpose.
Curious - what are you running the existing tests for? Is it to test the new functionality that you are moving into trip_model
? If so, you should really write a new unit test (which can use similar sample data) instead of ad-hoc testing.
While I move the functionalities, I am just using them to verify my understanding of the code-flow and I/O to trip_model
modules. That's why I felt a small randomly generated sample should serve the purpose. Once done, I do plan on writing new unit tests and may even test on the full CEO dataset.
The dependence of the 2 files as mentioned previously(clustering.py
and models.py
)are on the following 2 modules in the custom branch:
bin_helper
module in theSimilarity
Class from similarity.py
file.save_models
and create_user_input_map
modules from the build_save_model.py
file.fit
module again from the Similarity
Class from similarity.py
file.For i. , I probed the trip_model
folder in the main branch and we surely have this functionality already implemented in the main branch.
I can say this because to understand the functionalities and flow of data on the custom branch side, I ran the dependent notebooks on the CEO dataset - i.e. clustering_examples.ipynb
. Starting with this file, I noticed that it imports mapping.py
, which, in turn, imports clustering.py
, which in turn imports import emission.analysis.modelling.tour_model_extended.similarity as eamts
I then looked at the list of diffs in the extended branch https://github.com/e-mission/e-mission-server/compare/master...hlu109:e-mission-server:eval-private-data-compatibility pulled the changes, and compared them using diff. The result is attached (as a gzip file).
This notebook only uses eamts.Similarity
, which clusters based on O-D of the trips using the ecc.calDistance
.
To match this understanding on the main branch side, I ran the existing unit test specifically for the modeling modules using the below command.
PYTHONPATH=. python -m unittest emission/tests/modellingTests/TestGreedySimilarityBinning.py
The trip_model
rewrite and the eamts.Similarity
were passing origin/destination latitude and longitudes and relied on od_similarity.py
file for similarity computation,specifically the ecc.calDistance
function. This helped me reach the conclusion that the functionality and the end result of these two had an overlap.
However, both these functions are working on different data-forms, i.e., custom branch uses data in data-frame format while processing whereas the main branch uses them in Entry
type wrapper. My first thought was to convert the data frame to Entry
type data and use it . But on Further searching, I could figure that the data was being converted from Entry
type to data_frame at some-point before similarity computations and then visualizing them. Again converting it back would be unnecessary processing .
So we just need to change the way data is being passed to functions, without breaking any other functionalities.
We have not yet evaluated what are the other changes between tour_model_extended
and tour_model
, where they are used, and how they should be migrated to trip_model
. After finishing this migration (ETA EOD tomorrow), I will submit a draft PR and start on the other two.
~All-in-all, we won't have to do new implementations here, just a few changes should do.~
To begin with the changes for (i) at https://github.com/e-mission/e-mission-eval-private-data/issues/35#issuecomment-1670358021 , we start with the clustering.py
file . I can see that the end result of the "naive" algorithm under the add_loc_clusters
function is a column being added to the data frame having group labels ( numeric value ), at line that reads:
loc_df.loc[:, f"{loc_type}_{alg}_clusters_{r}_m"] = label
where
{loc_type}
-> start or end or trip{alg}
-> DBSCAN or naive or OPTIC or fuzzy or mean_shift. {r }
-> stands for the grouping radius, takes value 50,100 and 150 currently . These group labels are generated using the bin_helper
function present in the eamts.Similarity
class. bin_helper
calls match
which calls distance_helper
which calls within_radius
which calls ecc.calDistance
to calculate the distance between two points.
These generated labels are stored in two places:
Inside the eamts.Similarity
class they are stores in self.start_bins
or self.end_bins
or self.trip_bins
depending on value of loc_type
variable.
In the dataframe, the same results is stored in one of three new column amongst start_bin
/ end_bin
/ trip_bin
and the other two are filled with NaNs.
The additional column (columns if we pass multiple r values), that I referred to in the beginning of this comment, copies labels from (i) (above) and appends into the data frame.
IF we are able to generate these labels from the existing trip_model
in the main branch, we'll be able to remove our dependence on the custom branch.
IF we are able to generate these labels from the existing trip_model in the main branch, we'll be able to remove our dependence on the custom branch.
And what would it take to generate these labels from the existing trip_model in the main branch?
In the main branch ,trip_model
has a similar functionality implemented in the _assign_bins
function in the GreedySimilarityBinning
class in greedy_similarity_binning.py
file.
However, rather than storing bin labels of each trip next to it, _assign_bins
groups trips by their bin labels. In general, the data in GreedySimilarityBinning
class takes the form:
{
bin_id: {
"feature_rows": [
[f1, f2, .., fn],
...
],
"labels": [
{ label1: value1, ... }
],
"predictions": [
{ "labels": { label1: value1, ... }, 'p': p_val }
]
}
}
where
- bin_id: str index of a bin containing similar trips, as a string
(string type for bin_id comes from mongodb object key type requirements)
- f_x: float feature value (an ordinate such as origin.x)
- label_x: str OpenPATH user label category such as "mode_confirm"
- value_x: str user-provided label for a category
- p_val: float probability of a prediction, real number in [0, 1]
:param config: if provided, a manual configuration for testing purposes. these
values should be provided by the config file when running OpenPATH.
see config.py for more details.
For e.g., below is one such groups saved by _assign_bins
on a dummy data :
{'0': {'feature_rows':[[-4.958129127344314e-05, 8.768060347937052e-06, 0.999720312837768, 1.0001227632472012],
[-0.0002885431635348521, 4.2737498370605324e-05, 1.0002114424380646, 0.9998622709144951],
[0.00021187675902559317, -0.00031421641939745657, 0.9998863315113582, 1.0000619592251845],
[0.00041447629708705905, -0.0002639625392563102, 1.0001666270256024, 0.9995808207008069],
[-2.651656515482855e-05, 0.0004016395427976644, 0.9999933052671837, 0.9999978850795779]],
'labels': [{'mode_confirm': 'transit', 'replaced_mode': 'drive', 'purpose_confirm': 'school'},
{'mode_confirm': 'transit', 'replaced_mode': 'drive', 'purpose_confirm': 'work'},
{'mode_confirm': 'bike', 'replaced_mode': 'drive', 'purpose_confirm': 'work'},
{'mode_confirm': 'transit', 'replaced_mode': 'drive', 'purpose_confirm': 'school'},
{'mode_confirm': 'bike', 'replaced_mode': 'drive', 'purpose_confirm': 'school'}],
'predictions': []},
'1': {'feature_rows': [[0.06332302074892233, 0.02417860787080751, 1.0195018436957517, 0.9714679398747385]],
'labels': [{'mode_confirm': 'bike', 'replaced_mode': 'drive', 'purpose_confirm': 'work'}],
'predictions': []},
'2': {'feature_rows': [[0.055288906505610205, 0.016151126202667912, 1.0453033636120435, 0.9712560053027818]],
'labels': [{'mode_confirm': 'walk', 'replaced_mode': 'drive', 'purpose_confirm': 'home'}],
'predictions': []},
'3': {'feature_rows': [[0.0023381228366961128, -0.008375891067784268, 0.9628717886188353, 0.9681751183057479]],
'labels': [{'mode_confirm': 'transit', 'replaced_mode': 'drive', 'purpose_confirm': 'school'}],
'predictions': []},
.
.
.
.
}
Here 0,1,2 and 3 are bin labels which have trips ( actually they are features of trips ) belonging to that bin, grouped as a list.
We need _assign_bins
function to also give the trips with their labels.
One possible way is ( to be included in PR ) the line that gets the matching bin for the point is :
bin_id = self._find_matching_bin_id(trip_features)
here
Beyond this line, grouping of trips a.t. labels take place. I can introduce a self.binLabel
class variable of type list that'll append the bin Id for the trip being processed. Once we iterate all the trips, we'll have their labels in a list which we can return to TRB_label_assist
and can append to the data frame.
Changes that need to be done in the clustering.py
in TRB_label_assist
module to link this with trip_model
in main branch would be :
Initiate a GreedySiilarityBinning
type model with configs as :
model_config = {
"metric": "od_similarity",
"similarity_threshold_meters": r, # meters,
"apply_cutoff": False,
"incremental_evaluation": False
}
Pass the data to the fit function in the required format ,i.e., List[ecwc.Confirmedtrip]
Currently, add_loc_clusters
has data in df form. Need to get data in Confirmedtrip format so that it can be passed to fit function.
self.tripLabels
in GreedySiilarityBinning classAfter discussions on this PR, particularly this suggestion focuses on reducing load on production side e-mission-server
. So we'll move the label computations to clustering.py
.
More precisely, our aim, now, would be on clustering.py
side. We'll retrieve
trip feature --> binId( or label or bin No)
like mappings from
binId ---> [list of trip_feature]
like structure which GreedySimilarityBinning
class provides us. Then further, map every trip_feature( along with the binID) to the correct trip in dataframe.
For e.g. :
From grouped bins which look like this
{'0': {'feature_rows':[[-4.958129127344314e-05, 8.768060347937052e-06, 0.999720312837768, 1.0001227632472012],
[-0.0002885431635348521, 4.2737498370605324e-05, 1.0002114424380646, 0.9998622709144951],
[0.00021187675902559317, -0.00031421641939745657, 0.9998863315113582, 1.0000619592251845],
[0.00041447629708705905, -0.0002639625392563102, 1.0001666270256024, 0.9995808207008069],
[-2.651656515482855e-05, 0.0004016395427976644, 0.9999933052671837, 0.9999978850795779]],
'labels': [{'mode_confirm': 'transit', 'replaced_mode': 'drive', 'purpose_confirm': 'school'},
{'mode_confirm': 'transit', 'replaced_mode': 'drive', 'purpose_confirm': 'work'},
{'mode_confirm': 'bike', 'replaced_mode': 'drive', 'purpose_confirm': 'work'},
{'mode_confirm': 'transit', 'replaced_mode': 'drive', 'purpose_confirm': 'school'},
{'mode_confirm': 'bike', 'replaced_mode': 'drive', 'purpose_confirm': 'school'}],
'predictions': []},
'1': {'feature_rows': [[0.06332302074892233, 0.02417860787080751, 1.0195018436957517, 0.9714679398747385]],
'labels': [{'mode_confirm': 'bike', 'replaced_mode': 'drive', 'purpose_confirm': 'work'}],
'predictions': []},
'2': {'feature_rows': [[0.055288906505610205, 0.016151126202667912, 1.0453033636120435, 0.9712560053027818]],
'labels': [{'mode_confirm': 'walk', 'replaced_mode': 'drive', 'purpose_confirm': 'home'}],
'predictions': []},
'3': {'feature_rows': [[0.0023381228366961128, -0.008375891067784268, 0.9628717886188353, 0.9681751183057479]],
'labels': [{'mode_confirm': 'transit', 'replaced_mode': 'drive', 'purpose_confirm': 'school'}],
'predictions': []},
.
.
.
.
}
we retrieve tripFeature with theirs bin ids as below.
[-4.958129127344314e-05, 8.768060347937052e-06, 0.999720312837768, 1.0001227632472012] ---> {0}
[-0.0002885431635348521, 4.2737498370605324e-05, 1.0002114424380646, 0.9998622709144951]. ---> {0}
[0.00021187675902559317, -0.00031421641939745657, 0.9998863315113582, 1.0000619592251845], --->{0}
[0.00041447629708705905, -0.0002639625392563102, 1.0001666270256024, 0.9995808207008069], --->{0}
[-2.651656515482855e-05, 0.0004016395427976644, 0.9999933052671837, 0.9999978850795779]], --->{0}
[0.06332302074892233, 0.02417860787080751, 1.0195018436957517, 0.9714679398747385] ---> {1}
[0.055288906505610205, 0.016151126202667912, 1.0453033636120435, 0.9712560053027818] ---> {2}
[0.0023381228366961128, -0.008375891067784268, 0.9628717886188353, 0.9681751183057479 ---> {3}
Once we have tripFeature-bin mappings, we'll have to search for each trip (using tripFeature that we have) in the datagram loc_type
and then add bin label in the respective row.
This would result in no changes in e-mission-server
and all computations being on the e-mission-eval-private-data
.
In the clustering_examples.ipynb
, the data is loaded in Entry
format and then converted to df .This df is passed around the modules for computations and visalisations.
On the other hand , the trip_model
's fit
function in greedy_similarity_binning.py
accepts List of ecwc.Confirmedtrip
type data.
Looks like we can connect these two by additionally passing around ConfirmedTripType
type data (along with df ) inside the notebook and then passing only ConfirmedTripType
type data (for generating bin labels to) fit
function . Once we get bin labels, we follow the steps in the previous comment to add bin labels to respective trips in the data frame.
In the process to link them , I find that the load data
cell, esta.TimeSeries.get_time_series
is called, which invoke an object of BuiltinTimeSeries
. Then get_data_df
from BuiltinTimeSeries is called, which called find_entries
and to_data_df
. to_data_df
is responsible for converting Entry
to df by calling _to_df_entry
.
Once back from these calls in the load_data
cell, at this point we have dataframe. We can additionally get the Entry
type data as well, but filter_labeled_trips
and expand_userinputs
are working specifically on df in the notebook before passing df to tour_model_extended
's similarity
. Another way would be to convert df to ConfirmedTripType
data right before tour_model_extended
's similarity
(which will be replaced by trip_model
's fit
)function call.
After searching for a while, I can conclude that there is no functionality implemented to convert df to ConfirmedTrip
type. However, one way I can think of that can help us fulfill our goal using already implemented functionalities would be to generate a dummy entry using create_fake_entry
function in /emission/core/wrapper/entry.py
and then replace dummy entries with our data.
This is still a hypothesis and needs to be tested.
To test this hypothesis, I can check the column names in trips generated by generate_mock_trips
in TestGreedySimilarityBinning.py
(which are unit tests basically) and match if I have same columns in dataframe in clustering.py
. If they are not same, I can check also reference the columns during Entry
type to df conversion and see if certain columns were dropped for some reason there.
@humbleOldSage one of the entries in get_data_df
is the object id (stored as _id
).
So you should be able to, in the analysis code, keep the list of trip entries around, and once the filtering is done, filter the list of entries based on the _id
s retained in the filtered df. That way, you can pass in real confirmed trip objects instead of fiddling around with conversions - confirmed_trip
-> data_df
-> fake confirmed trip.
If we are going to use dataframes extensively, we can also consider changing trip_model
to take in a dataframe instead of a list of confirmed trips. I think that Rob structured the module this way so that the feature extraction could be modular. But if we pass in a dataframe, then the feature extraction would be column selection and/or new column creation from existing columns, possibly by using apply
If we do decide to move to df in trip_model
, it would just be column extraction (columns with sat_lat, start_lon , end_lat and end_lon). However, we'll have to rewrite fit
, which includes calls to _apply_cutoff()
( current configs pass apply_cutoff = False
so not being used but is part of fit
function) and _assign_bins
( which includes feature extraction as well). We can keep this option open in case nothing else works out.
Using object id, post filtering, seems like a good alternate to conversions. I'll pick this route for now and move forward.
Just want to clarify that:
_assign_bins
takes in trips becomes less importantagain fine with taking the object id post filtering is fine for now; just realize that you might need to polish it again downstream
Then I think it makes more sense to switch to df based processing from now itself. I'll rollback on the Trip
changes I did on my end and take the data-frame way.
@shankari I might be wrong here , but does the following make sense ?
From my understanding, I feel if we take the df way, then from TRB_label_assist
's POV, this essentially mean we can replace trip_model
( of the main branch) with tour_model_Extended
(from hlu09
's( custom) branch ) . This is because all the operations in custom branch ( even data structures declared) are already using data-frames. If data-frame friendly operations in trip_model
are what we aim to achieve, then currently they are in tour_model_extended
. No need to reimplement all of trip_model
's function in df-friendly format to make 'TRB_label_assist' work . Simply replace 'tour_model_Extended' in place of trip_model
.
However, It is ESSENTIAL to point out that my understanding above is from TRB_label_assist
's POV. Meaning, the implementations in tour_model_Extended
will make TRB_label_assist
to work. But ..
It might be possible that there are dependencies on trip_model
from modules other than TRB_label_assist
( which I am not aware of currently) which might not even be implemented in custom branch's tour_model_extended
. So, if we decide to replace trip_model
with tour_model_Extended
, we'll have
trip_model
to 'tour_extended_model' and call this upgraded_tour_model_Extended
. trip_model
(or upgraded_tour_model_Extended
) can use it.The task, I believe, then MIGHT translate to upgrading 'tour_model_extended' to be an exact df version of trip_model
. Every function of trip_model
implemented in df format.
A visual explanation :
thank you for the visual representation. I agree with the first two images, but I don't understand/don't agree with the third. What do you mean by an upgraded tour_model replacing the trip_model?
The interface of trip_model will take a dataframe, so I guess they are upgraded in that sense. But I don't think that it is a good idea to remove all the software engineering around trip model and just replace it with the expanded tour model.
Again, having the label assist use trip_model instead of tour model was intended as an easy-to-implement first step to get you familiar with the codebase.
From the original project definition: https://github.com/e-mission/e-mission-eval-private-data/issues/35#issue-1819040402
After we are done with this, we should integrate the random forest model + prediction into the e-mission server code for label assist.
Trip model is structured so that it can support multiple algorithms in parallel in a modular fashion. The bin-based algorithm that is currently implemented in expanded tour model is just one of them. The goal is to just change that implementation to the expanded version, not replace trip_model entirely
The figures generated after 97406c437f5fedb37abf96f65c04480c6c6f78b7 , are different from the paper. Below are the screenshots . To the left are screenshots from the commit I mentioned. To the right are the ones from the paper
SUBURBAN AREA. :
50m
100m
150m
COLLEGE CAMPUS :
In case of college campus, I am unable to determine what area of the map is the screenshot (in the paper) from. However, even if I compare the old custom branch and the recent commit ones, they differ.
It is from near LA - between LA and Irvine
I think one of the reason, why the labels are different is due to the format of the input to the ecc.calDistance
function. The custom branch passes the points in [lon,lat]
format, the main branch is using [lat,lon]
format for this. However, by itself, this doesn't seem to have solved the issue. I am still looking for what else could be doing this.
The other thing that was causing the issues is the similar
function in eamms.SimilarityMetric
module, particularly line 40. It calculates similarity based on both origin and destination when we might require origin/destination/origin-and-destination points.
On looking through these two things collectively, its just the second reason that was causing the problem. The first one is correct in itself.
Once, I do the just the second changes, the suburban results match for r= 50,100,150 match with the paper. Let me check the college ones as well.
It works for college campus as well.
In general, for this, it is fine to make the change in the master branch because we don't have to use it in production, but we can if we want to. We should just make sure that it is configurable so we can continue using O-D on production but use the other options for evaluation only for now.
In the timeseries, we did not want to make the change in master because there isn't only one set of entries we want to read so the change was not modular enough.
In this case, trip_model
is set up to be modular, so we can just use that by adding new modules.
With the changes in fit
function, till now, there are dependencies in almost all notebooks and files that needs to be changed.
classification_performance.ipynb
:has dependence on cv_for_all_algs
from performance_eval
, which calls to fit
functions of ClusterExtrapolationClassifier
, which calls RefractoredNaivebinning
, which is dependent on eamts.Similarity
of the custom branch.
I've shift this to eamtg.GreedySimilarityBinning
( as done in clustering.py
). In doing so, one way is I'll change the parameters fit function
of all three ClusterExtrapolationClassifier
, ForestClassifier
, and NaiveBinningClassifier
to receive the Entry
type data of the dataframe from the notebook.Not all these three need the entry type data.
The other and better way is to check if the model_ is of instance ClusterExtrapolationClassifier
and then call a different fit function for it.
While running regenerate_classification_performance_results.py
, the following message occurs while running NaiveBinningClassifier
model (named as fixed-width (O-D)
)
running cross validation for model: fixed-width (O-D)
2023-08-28 03:45:35,441 INFO ------ START: predictions for user 00db212b-c8d0-44cd-8392-41ab4065e603 and model <class 'models.NaiveBinningClassifier'>
2023-08-28 03:45:35,454 DEBUG num trips 724
2023-08-28 03:45:35,471 INFO ----- Building model <class 'models.NaiveBinningClassifier'> for fold 0
2023-08-28 03:45:35,472 INFO PERF: Initializing NaiveBinningClassifier
2023-08-28 03:45:35,512 INFO About to fit the model <class 'models.NaiveBinningClassifier'>
2023-08-28 03:45:35,512 INFO PERF: Fitting NaiveBinningClassifier
2023-08-28 03:45:43,605 INFO About to generate predictions for the model <class 'models.NaiveBinningClassifier'>
2023-08-28 03:45:43,605 INFO PERF: Predicting NaiveBinningClassifier
2023-08-28 03:45:43,669 DEBUG getting key model_type in config
2023-08-28 03:45:43,669 DEBUG getting key model_storage in config
2023-08-28 03:45:43,691 DEBUG no GREEDY_SIMILARITY_BINNING model found for user 00db212b-c8d0-44cd-8392-41ab4065e603
2023-08-28 03:45:43,691 DEBUG In predict_cluster_confidence_discounting: n=-1; returning as-is
.
.
.
.
.
2023-08-28 03:45:44,545 DEBUG getting key model_type in config
2023-08-28 03:45:44,545 DEBUG getting key model_storage in config
2023-08-28 03:45:44,549 DEBUG no GREEDY_SIMILARITY_BINNING model found for user 00db212b-c8d0-44cd-8392-41ab4065e603
2023-08-28 03:45:44,549 DEBUG In predict_cluster_confidence_discounting: n=-1; returning as-is
2023-08-28 03:45:44,549 DEBUG getting key model_type in config
2023-08-28 03:45:44,550 DEBUG getting key model_storage in config
2023-08-28 03:45:44,554 DEBUG no GREEDY_SIMILARITY_BINNING model found for user 00db212b-c8d0-44cd-8392-41ab4065e603
2023-08-28 03:45:44,554 DEBUG In predict_cluster_confidence_discounting: n=-1; returning as-is
2023-08-28 03:45:44,556 ERROR skipping user 00db212b-c8d0-44cd-8392-41ab4065e603 due to error: ValueError('attempt to get argmax of an empty sequence')
Traceback (most recent call last):
File "/Users/ssaini/Documents/GitHub/e-mission-eval-private-data/TRB_label_assist/performance_eval.py", line 243, in cv_for_all_users
min_samples=min_samples)
File "/Users/ssaini/Documents/GitHub/e-mission-eval-private-data/TRB_label_assist/performance_eval.py", line 179, in cross_val_predict
pred_df = model_.predict(test_trips)
File "/Users/ssaini/Documents/GitHub/e-mission-eval-private-data/TRB_label_assist/models.py", line 192, in predict
proba_df = self.predict_proba(test_df)
File "/Users/ssaini/Documents/GitHub/e-mission-eval-private-data/TRB_label_assist/models.py", line 742, in predict_proba
axis=1)
File "/Users/ssaini/miniconda-4.8.3/envs/emission-private-eval/lib/python3.7/site-packages/pandas/core/frame.py", line 8861, in idxmax
indices = nanops.nanargmax(self.values, axis=axis, skipna=skipna)
File "/Users/ssaini/miniconda-4.8.3/envs/emission-private-eval/lib/python3.7/site-packages/pandas/core/nanops.py", line 71, in _f
return f(*args, **kwargs)
File "/Users/ssaini/miniconda-4.8.3/envs/emission-private-eval/lib/python3.7/site-packages/pandas/core/nanops.py", line 924, in nanargmax
result = values.argmax(axis)
ValueError: attempt to get argmax of an empty seque
On investigating, it seems model was fit and saved correctly. predict_proba
is called in NaiveBinningClassifier , which has a call for predict_cluster_confidence_discounting
in emission.analysis.classification.inference.labels.inferrers
. Inside this is a call for predict_labels_with_n
which is initiating Greedy similarity binning call. However a different model , similarity.similarity
from tour_model
, was used for fitting the data and saved. This seems to be causing the issue
To tackle this, we can use Greedy similarity
from trip_model
(with fixed radius) for fitting and save that model. And then load would work similarly.
The other and better way is to check if the model_ is of instance ClusterExtrapolationClassifier and then call a different fit function for it.
I am not sure that this is better. Your instinct is to have if/else
everywhere. that is not necessarily always correct.
Consistent interfaces are sometimes better than avoiding new code (but not always)
To tackle this, we can use Greedy similarity from trip_model (with fixed radius) for fitting and save that model. And then load would work similarly.
So have you actually done this yet?
Yeah. This is already done in commit : a34836faf364f9548d3ab5373c8efd0d1e729a53.
Before the commit, regenerate_classification_performance_results.py
file won't run and threw error due to reason mentioned above. Its now working and generated the Cross validation CSVs .
Let me run the classification_performance.ipynb
again and post the performance graphs here.
Plots below are from the paper and from the latest run of the code . They match.
Accuracy and Weighted F-score for each model :
Plot from the paper
Plot from classification_performance.ipynb
after the changes
Weighted F-score for purpose prediction
Plot from the paper
Plot from classification_performance.ipynb
after the changes
This will:
The code is in: https://github.com/e-mission/e-mission-eval-private-data/tree/master/TRB_label_assist It works, but it requires us to use a custom fork of the e-mission-server repo to work
We should see the changes between the custom fork and the current e-mission-server master/gis branch See if the changes are still necessary or whether there are implementations in e-mission-server that have superceded them If they have not been superceded, we need to incorporate them in e-mission-server
So at the end, this should be reproducible against a core e-mission-server repo
Stretch goal: change this repo to work with docker containers that are built with the base e-mission-server container so that people don't have to install e-mission-server and set the EMISSION_SERVER_PATH and all the funky setup steps.
After we are done with this, we should integrate the random forest model + prediction into the e-mission server code for label assist.