Incremental model building for label inference models should be supported

e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.

https://e-mission.readthedocs.io/en/latest

BSD 3-Clause "New" or "Revised" License

15 stars 32 forks source link

Incremental model building for label inference models should be supported #709

Open robfitzgerald opened 2 years ago

robfitzgerald commented 2 years ago

In OpenPATH, a daily background analysis task fires off which re-runs a clustering model. For each user, the entire history of recorded labeled/unlabeled data is collected. A clustering model is trained on the entire labeled dataset. The new trained model replaces any existing model and is used for label inference by e-mission-server.

This works for e-mission-server instances with few user participants or short-term data collection scenarios, but does not scale well for scenarios where the opposite is true. to address this, label inference could be refactored to support incremental clustering models.

The model building pipeline for e-mission-server should be modified to maintain state. When activated, this pipeline will pull all data that has arrived since the last update (recorded via a PipelineState). It will load a stored incremental model for this user, and then it will perform model training with only the new data. The PipelineState will be updated to the latest-observed time and the revised incremental model will be stored.

This issue describes building the mechanism for incremental clustering. Implementing actual incremental clustering models will be left for future work.

robfitzgerald commented 2 years ago

see build_label_model.py to see the main method that is executed to start the inference task.
see this commit that is an example of adding a new step in the analysis pipeline
the PipelineState class

shankari commented 2 years ago

@robfitzgerald sounds good overall.

Couple of caveats:

this is the model building pipeline: There is a separate analysis pipeline which reads the model, performs the inferences and associates the inference results with the trips. The analysis pipeline runs every hour. The model building pipeline runs every day. This work is not anticipated to significantly change the analysis pipeline.
- The only potential change might be to the method for reading the model, if we choose to change the model data structure for better performance.
- Concretely "To support this, a new analysis pipeline should be developed for e-mission-server. " -> "The model building pipeline for e-mission-server should be modified to maintain state"
the sample commit adds a new step to the existing analysis pipeline: you cannot directly copy and paste it, but you can use it as a example of how to modify the model building pipeline to maintain state. As we have discussed, the model building pipeline currently does not maintain any state at all.
- Concretely "see this commit that is an example of adding a new analysis pipeline" -> "see this commit that is an example of adding a new step in the analysis pipeline"

Also, not sure why you think we should create a new model building pipeline instead of modifying the existing one to maintain state...

robfitzgerald commented 2 years ago

This work is not anticipated to significantly change the analysis pipeline

my assumption as well.

not sure why you think we should create a new model building pipeline instead of modifying the existing one to maintain state

you know, that really comes down to what you want. i came in with the assumption that we were creating a new pipeline, so that backwards compatibility could be maintained, and the code could be accepted without requiring that the incremental inference models are also completed at the same time, since those are being developed elsewhere on a separate schedule. but i trust your guidance on what is the right scope for this task.

shankari commented 2 years ago

i came in with the assumption that we were creating a new pipeline, so that backwards compatibility could be maintained, and the code could be accepted without requiring that the incremental inference models are also completed at the same time

@robfitzgerald Let me clarify further. There are two pipelines - the model building pipeline (which runs once a day) and the analysis pipeline (which runs every hour).

The existing model building pipeline is single-stage and is launched from build_model.py. It builds individual cluster-based models and saves them to disk to files named user_labels_first_round_<user_id>. It is not incremental, it reads all the trips for a particular user on every run.
The existing analysis pipeline is multi-stage and is launched from bin/intake_multiprocess.py. As one of the stages, it reads the saved model and infers the user labels for newly arrived trips.

The goal here is to change the existing model building pipeline to be incremental. If the modified model building pipeline writes the same data structure to the same file, then backwards compatibility is maintained.

If the data structure is changed for better performance, then only the step in the analysis pipeline which reads the stored model needs to be changed to match.

Whether the modified model building pipeline can retain the same data structure depends on the new incremental clustering algorithm. If the new clustering algorithm needs a new data structure, I expect that you will change it, and change the load step of the analysis pipeline to match.

I am not sure what you mean by "incremental inference models are also completed at the same time, since those are being developed elsewhere on a separate schedule". The incremental inference models are complete and are in production and have been since Summer 2021.

robfitzgerald commented 2 years ago

I am not sure what you mean by "incremental inference models are also completed at the same time..."

thanks for taking the time to disambiguate. i was incorrectly using the term "inference model" to refer to both phases.

correcting myself, i thought you planned to have someone this summer implement clustering algorithms which could take advantage of this incremental model building pipeline. taking both separately:

clustering algorithms

my understanding was that this ticket was only to write the model building process; that the integration/implementation of incremental clustering algorithms themselves was left to future work from someone else this summer.

model building

my question around creating a new pipeline comes from the structure of the problem. whereas model building for a traditional clustering algorithm can be thought of as a function

def build_model(data: Iterable[Observation]) -> ClusteringModel

the incremental model building phase has this shape (and execution has the pseudocode):

def build_model(data: Iterable[Observation], previous_model: ClusteringModel) -> Tuple[ClusteringModel, DateTime]

# ...

for agent in agents:
  data = read_data(agent)
  model = read_model(agent)
  updated_model, time_state = build_model(data, model)
  write_model(updated_model)
  update_pipeline_state(agent, time_state)

the conclusion that we could want a new pipeline was based on these changes being a breaking API change to the model building pipeline.

but, maybe they're not. maybe we could retrofit the old clustering algorithms to this signature by simply ignore the previous_model and then return something like datetime.fromtimestamp(0) for a "time_state" value.

shankari commented 2 years ago

@robfitzgerald Let me start with the API question since it is easier. There is no external API that involves the model building code. It is invoked only from build_model.py, which is run every hour using a cronjob. So there are no external dependencies and no need to maintain backward compatibility. As you change the interface to build_model, you can just make changes to the calling interface in build_model as well.

I think that the remaining confusion is around the various stages of the modeling process and their nomenclature, primarily because the model building process uses clustering. I'd like to propose the following terminology to reduce confusion:

start_end clustering algorithm: the existing clustering algorithm which clusters based on spatial distance of the start and end points only. This is the one that you will be converting to incremental.
enhanced labeling algorithm: the work that Hannah will do over the summer to incorporate additional features for the model. I think I was confused because this may not necessarily involve clustering - an option is to use a two-level model which clusters based on start and end points and then uses a classification algorithm to split further. We could even use multiple classification algorithms, one for each label.
model application: the part of the analysis pipeline that matches incoming trips against the model

I think we are in agreement here, but just for the record:

start_end clustering algorithm scalability: Rob
model enhancements: Hannah
model application: untouched

Makes sense?

robfitzgerald commented 2 years ago

As you change the interface to build_model, you can just make changes to the calling interface in build_model as well

ok! great, i'll bring my gardening shears ✂️

robfitzgerald commented 2 years ago

@shankari i did a more extensive code review to understand all of the modules that relate to the model building aspect. here's my updated take on understanding this task:

refactor the clustering model code to use an abstraction
- the abstraction should give us methods for load, save, fit, and get_label
  - load/save: should follow the same file writing logic as the current model, but should not impose a strict notion of the file type/structure in order to support arbitrary model types
  - get_label: the "bins" of the similarity class leak out into the surrounding code, but, a greedy similarity search through bins is not likely to work with a true incremental clustering approach, so the get_label method(s) should replace any code that directly interacts with bins outside of the similarity class
refactor similarity.py and tour_model_first_only to use the abstraction
- rename things that are confusing and remove deprecated code like the second_round stuff (or label it as deprecated)
separately write the user_labels and the model_state to file
- for the similarity.py model, these will be the same, but, for future model development, they can (should) differ
  - this is the essence of the incremental approach: it should not store and retrieve the entire history, but, only a latent representation of that data
  - leave the development of a "real" incremental model to future work
only request the latest data when building the model
- pipeline state update tracking latest trips used in model building

robfitzgerald commented 2 years ago

also, sorry it's taken me so long to get started on this. i have only logged 2 task hours on this work, so i'm not poaching i swear 😼 . i mis-estimated my time requirements for another project i started a month ago, which i expected would take a week. as of today, i'm finally delivering those results and should be moving on shortly. that should leave just MEP and OpenPATH on my plate for the next month or two, for the most part.

robfitzgerald commented 2 years ago

@shankari

question about a generic algorithm signature/type. i'm reviewing the spot where prediction occurs. in pipeline.py we have the dict of algorithms that's effectively the API/export module for OpenPATH to invoke label classification algorithms. in inferrers.py we have our only implementation, with the signature:

def predict_cluster_confidence_discounting(trip, max_confidence=None, first_confidence=None, confidence_multiplier=None)

while considering the abstraction for future algorithms, i noticed that the confidence stuff is basically unused, but, i'm guessing there's an implication that it could be used. so, wondering which signature you prefer:

# keep the coefficient parameters
PredictionFunction1 = Callable[[ConfirmedTrip, Optional[float], Optional[float], Optional[float]], List[Prediction]]

# assume the parameters are set globally, not at the call site
# (as the coefficients are effectively static in the current codebase)
PredictionFunction2 = Callable[[ConfirmedTrip], List[Prediction]]

where a prediction is one of these dudes:

class Prediction(TypedDict):
  labels: Dict[str, str]
  p: float

i'm thinking #2.

shankari commented 2 years ago

@robfitzgerald I am not sure why you would need to change the inferrers (that apply the clustering model) at this point. As I indicated earlier, we do not really need to maintain backwards compatibility. I'd assumed that you will need to make all changes only in emission/analysis/modelling/tour_model_first_only

The only line outside of emission/analysis/modelling/tour_model_first_only that you might have to change is in emission/analysis/classification//inference/labels/inferrers.py

143     labels, n = lp.predict_labels_with_n(trip)

In other words, I don't think we should refactor the inferrer at all at this point. We should focus only on the interface to the current modeling code.

We overengineered the label classification algorithm creation a bit - since we don't currently have other algorithms to experiment with and our current algorithm needs significant improvement. Don't want to spend more time and energy polishing the interface instead of fixing the one implementation that we do have.

robfitzgerald commented 2 years ago

I mentioned it earlier, there's a lot of implementation details that spill out from the clustering algorithm that is implemented. In particular, "getting" a label assumes all points exist in a model and that we can find a best fitting label by applying a greedy nearest neighbor heuristic. Those assumptions will not hold for any non-trivial online clustering algorithm where the model is some approximation of all of that information.

So, in order to support both the current model and future models, I was suggesting that we could build a suitable abstraction over the current solution which could be implemented by future models. That impacts both building and inference phases.

On Wed, Apr 6, 2022, 6:49 PM shankari @.***> wrote:

@robfitzgerald https://github.com/robfitzgerald I am not sure why you would need to change the inferrers (that apply the clustering model) at this point. As I indicated earlier, we do not really need to maintain backwards compatibility. I'd assumed that you will need to make all changes only in emission/analysis/modelling/tour_model_first_only

The only line outside of emission/analysis/modelling/tour_model_first_only that you might have to change is in emission/analysis/classification//inference/labels/inferrers.py

125 # Non-placeholder implementation. First bins the trips, and then clusters every bin 126 # See emission.analysis.modelling.tour_model for more details 127 # Assumes that pre-built models are stored in working directory 128 # Models are built using evaluation_pipeline.py and build_save_model.py 129 # This algorithm is now DEPRECATED in favor of predict_cluster_confidence_discounting (see https://github. com/e-mission/e-mission-docs/issues/663) 130 def predict_two_stage_bin_cluster(trip): 131 return lp.predict_labels(trip) 132

— Reply to this email directly, view it on GitHub https://github.com/e-mission/e-mission-docs/issues/709#issuecomment-1090967223, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVNXDVRYHYJC4EYOJIUFN3VDYWKPANCNFSM5QZJZGGA . You are receiving this because you were mentioned.Message ID: @.***>

shankari commented 2 years ago

I'm confused, the greedy nearest neighbour matching is at emission/analysis/modelling/tour_model_first_only/load_predict.py

 97     logging.debug(f"At stage: first round prediction")
 98     pred_bin = find_bin(trip, bin_locations, RADIUS)
 99     logging.debug(f"At stage: matched with bin {pred_bin}")

Essentially the lp.predict_labels_with_n(trip) method just returns a list of (labels, trip_count_with_label) tuples.

So for home -> work with a 50:50 mode split between bus and driving alone, this might be:

[{labels: {"mode": bus", "purpose": work}, count: 10}, {labels: {"mode": drove_alone", "purpose": work}, count: 10}]

The implementation details of NN and bins are hidden from the inferrer.

It would be great to refactor the code within emission/analysis/modelling/tour_model_first_only so it is not so bin-centric, but I still don't see why the interfaces in the inferrer need to be changed.

Are you saying that the online clustering algorithms cannot generate the same list of (labels, trip_count_with_label) tuples? Can you clarify a bit further?

robfitzgerald commented 2 years ago

I'll have to revisit tomorrow (on my phone right now) but, from what I remember, things like how "bin locations" is an argument to "find bins", these are all implementation details for a specific clustering algorithm that performs a greedy heuristic search in the bins for a single match. For an online clustering algorithm that doesn't "repeat itself" (doesn't need to re-read the entire agent history every time), it wouldn't have the same operation on the same data structure. So, if we leave it like it is, and someone wants to implement an alternative that is a peer to the current technique, they will first need to do what I'm proposing to do now. It seems like the correct level of abstraction to bundle them, since certain prediction algorithms will only work with certain model implementations. Like, instead of a 2d array of Confirmed trips, another might use a 2d array of floats for the feature space and a vector of integer labels. Another more likely case for this situation is 2d array of floats and a projection mapping from degrees lat lon into that latent space. Another may be a decision tree.

As for the prediction output, I do think a list of predictions is generic enough. I propose keeping those just the way they are in my comment above.

On Wed, Apr 6, 2022, 7:53 PM shankari @.***> wrote:

I'm confused, the greedy nearest neighbour matching is at emission/analysis/modelling/tour_model_first_only/load_predict.py

97 logging.debug(f"At stage: first round prediction") 98 pred_bin = find_bin(trip, bin_locations, RADIUS) 99 logging.debug(f"At stage: matched with bin {pred_bin}")

Essentially the lp.predict_labels_with_n(trip) method just returns a list of (labels, trip_count_with_label) tuples.

So for home -> work, this might be:

[{labels: {"mode": bus", "purpose": work}, count: 10}, {labels: {"mode": drove_alone", "purpose": work}, count: 10}]

The implementation details of NN and bins are hidden from the inferrer.

It would be great to refactor the code within emission/analysis/modelling/tour_model_first_only so it is not so bin-centric, but I still don't see why the interfaces in the inferrer need to be changed.

Are you saying that the online clustering algorithms cannot generate the same list of (labels, trip_count_with_label) tuples?

— Reply to this email directly, view it on GitHub https://github.com/e-mission/e-mission-docs/issues/709#issuecomment-1090995327, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVNXDQ4W3SP2BWD7BRDQY3VDY5YFANCNFSM5QZJZGGA . You are receiving this because you were mentioned.Message ID: @.***>

shankari commented 2 years ago

things like how "bin locations" is an argument to "find bins", these are all implementation details for a specific clustering algorithm that performs a greedy heuristic search in the bins for a single match.

All of those are within emission/analysis/modelling/tour_model_first_only

As for the prediction output, I do think a list of predictions is generic enough. I propose keeping those just the way they are in my comment above.

the predictor (load_predict.py still within emission/analysis/modelling/tour_model_first_only) returns exactly this data format. The inferrer just inserts this (with some additional sophistication) into the trip.

shankari commented 2 years ago

I thought about this some more last night, and if you want to use the existing mechanism to support both the full algorithm and the incremental algorithm in parallel, I would not be opposed. Might make the evaluation easier as well.

The point that I am trying to make is that the interface that you need to clean up is the one in emission/analysis/modelling/tour_model_first_only.

The labeling algorithm alternatives in emission/analysis/classification/inference/labels/pipeline.py

primary_algorithms = {
    ecwl.AlgorithmTypes.CONFIDENCE_DISCOUNTED_CLUSTER: eacili.predict_cluster_confidence_discounting
}

are designed for future support for ensemble methods and are actually pretty well-thought-out.

The code that is a mismatch of hacked together data structures is in emission/analysis/modelling/tour_model_first_only but when I integrated it, I isolated the hackiness and ensured that it didn't spill over outside.

Let me know if you find anything that did spill over and I will fix it.

I would not waste time optimizing anything outside emission/analysis/modelling/tour_model_first_only at this point.

shankari commented 2 years ago

In https://github.com/e-mission/e-mission-server/pull/852, @robfitzgerald refactored the current code into a separate trip_model that is more configurable and extensible and includes unit tests.

Pending changes are:

[x] store the resulting documents in a separate database instead of the analysis database. The analysis database is not really designed for storing models. Unlike the user inputs, which are read-only from an analysis perspective, or analysis results, which are write once (except for confirmed trips), the models are updated frequently, every time they are re-calibrated.
[ ] potentially move the model state from the pipeline since it is not part of the intake pipeline; it has its own pipeline
[ ] purge obsolete entries from the model DB so we don't grow unboundedly
[ ] write unit tests for the new model DB
[ ] move to new label assist model
[ ] build label assist models incrementally

The related comments from https://github.com/e-mission/e-mission-server/pull/852 are:

shankari commented 2 years ago

Let's get the first part implemented so that we can deploy it and meet our commitments. https://github.com/e-mission/e-mission-server/pull/874

shankari commented 2 years ago

Testing on staging:

Before run

>>> edb.get_model_db().count_documents({})
0

Run logs

2022-08-12 07:18:50,847:DEBUG:no GREEDY_SIMILARITY_BINNING model found for user 9c084ef4-2f97-4196-bd37-950c17938ec6
2022-08-12 07:18:50,848:DEBUG:building first GREEDY_SIMILARITY_BINNING user label model for user 9c084ef4-2f97-4196-bd37-950c17938ec6
2022-08-12 07:18:50,932:INFO:For stage PipelineStages.TRIP_MODEL, start_ts is None
...
2022-08-12 07:18:51,098:DEBUG:model type GREEDY_SIMILARITY_BINNING is incremental? False
2022-08-12 07:18:51,098:DEBUG:time query for training data collection: None
2022-08-12 07:18:56,742:DEBUG:curr_query = {'invalid': {'$exists': False}, 'user_id': UUID('9c084ef4-2f97-4196-bd37-950c17938ec6'), '$or': [{'metadata.key': 'analysis/confirmed_trip'}]}, sort_key = metadata.write_ts
2022-08-12 07:18:56,743:DEBUG:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/confirmed_trip']
2022-08-12 07:18:56,743:DEBUG:finished querying values for [], count = 0
2022-08-12 07:18:56,832:DEBUG:finished querying values for ['analysis/confirmed_trip'], count = 23
2022-08-12 07:18:56,832:DEBUG:orig_ts_db_matches = 0, analysis_ts_db_matches = 23
found 23 training rows
2022-08-12 07:18:57,157:DEBUG:found 8 labeled trips for user 9c084ef4-2f97-4196-bd37-950c17938ec6
2022-08-12 07:18:57,157:DEBUG:Total: 8, labeled: 8, user 9c084ef4-2f97-4196-bd37-950c17938ec6 doesn't have enough valid trips for further analysis.

shankari commented 2 years ago

After labeling more trips:

Run logs

2022-08-12 07:39:49,430:DEBUG:no GREEDY_SIMILARITY_BINNING model found for user 9c084ef4-2f97-4196-bd37-950c17938ec6
2022-08-12 07:39:49,430:DEBUG:building first GREEDY_SIMILARITY_BINNING user label model for user 9c084ef4-2f97-4196-bd37-950c17938ec6
...
2022-08-12 07:39:49,527:INFO:For stage PipelineStages.TRIP_MODEL, start_ts is None
2022-08-12 07:39:49,744:DEBUG:model type GREEDY_SIMILARITY_BINNING is incremental? False
2022-08-12 07:39:49,744:DEBUG:time query for training data collection: None
2022-08-12 07:39:55,274:DEBUG:curr_query = {'invalid': {'$exists': False}, 'user_id': UUID('9c084ef4-2f97-4196-bd37-950c17938ec6'), '$or': [{'metadata.key': 'analysis/confirmed_trip'}]}, sort_key = metadata.write_ts
2022-08-12 07:39:55,274:DEBUG:orig_ts_db_keys = [], analysis_ts_db_keys = ['analysis/confirmed_trip']
2022-08-12 07:39:55,274:DEBUG:finished querying values for [], count = 0
2022-08-12 07:39:55,353:DEBUG:finished querying values for ['analysis/confirmed_trip'], count = 23
2022-08-12 07:39:55,353:DEBUG:orig_ts_db_matches = 0, analysis_ts_db_matches = 23
found 23 training rows
2022-08-12 07:39:55,584:DEBUG:found 23 labeled trips for user 9c084ef4-2f97-4196-bd37-950c17938ec6
2022-08-12 07:39:55,584:DEBUG:fit called with 23 trips
2022-08-12 07:39:55,584:DEBUG:_assign_bins called with 23 trips
2022-08-12 07:39:55,687:INFO:greedy binning model fit to 23 rows of trip data
2022-08-12 07:39:55,688:DEBUG:upsert_doc called with key inference/trip_model
2022-08-12 07:39:55,700:DEBUG:Inserting entry Entry({'_id': ObjectId('62f6663b22f44988e208f2dc'), 'user_id': UUID('9c084ef4-2f97-4196-bd37-950c17938ec6'), 'metadata': Metadata({'key': 'inference/trip_model', 'platform': 'server', 'write_ts': 1660315195.699139, 'time_zone': 'America/Los_Angeles', 'write_local_dt': LocalDate({'year': 2022, 'month': 8, 'day': 12, 'hour': 7, 'minute': 39, 'second': 55, 'weekday': 4, 'timezone': 'America/Los_Angeles'}), 'write_fmt_time': '2022-08-12T07:39:55.699139-07:00'}), 'data': Tripmodel({'model_ts': 1658361831.0, 'model_type': 'GREEDY', 'model': {'0': {'feature_rows': [...], '1': {'feature_rows': [...], '2': {'feature_rows': [...], '3': {'feature_rows': [...]}...}) into model DB

After run

>>> edb.get_model_db().count_documents({})
1

shankari commented 2 years ago

Took a trip after this, and got a predicted label on the NREL staging environment.