Added data monitor feature

yapweiyih commented 3 years ago

Issue #, if available:

Description of changes: This enhancement is to add data monitor support to understand baseline training statistic, and calculate Population Stability Index (PSI) for new inference data. The job is run on a custom container which can be customised as needed.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

josiahdavis commented 3 years ago

Thanks for the PR @yapweiyih! This is filling a needed gap. Totally agree with having it as a stand-alone module as well. Looks like a great start too. Couple high-level questions:

Can you clarify how this would be used in a Training/Inference pipeline conceptually? e.g., the monitoring would be a step in the StepFunction that comes before preprocessing step in the Training or Inference Pipeline.
How would this code be used from a practical perspective? e.g., Would it require manual copy/paste into the {training,inference}_pipeline_create.py
I just noticed in monitoring.py https://github.com/awslabs/mlmax/blob/7ba4a036a611657e30c60fe09fa7efabe40a3aeb/src/mlmax/monitoring.py#L271 that there is an if statement for args.mode == train but not for args.mode == infer. Are you planning to add one in for infer as well?
Can you explain the core logic of how you are using the PSI calculation in this example? In the README you mention that PSI is for inference data, however, I don't understand relevance of calculating the PSI based on a random train/test split at inference time as it seems you are doing here. Sorry, I'm probably missing something obvious here.

yapweiyih commented 3 years ago

It will come before both training and inference. Training will calculate the baseline statistic, and the PSI between train/test data (should be low PSI score). Inference can just calculate PSI between new infer data and train data (should have low PSI score as well, else should look further into data drift).
Yes, should be some slight copy and paste into train/inference pipeline if required.
Currently only train is done to calculate baseline statistic, and PSI between train/test. infer mode can be added to calculate between new infer data (once it is clear where new data is located) and train data.
As per point 1 and 3.

josiahdavis commented 3 years ago

Thanks, helpful explanations. One follow-up:

Currently only train is done to calculate baseline statistic, and PSI between train/test. infer mode can be added to calculate between new infer data (once it is clear where new data is located) and train data.

Can you explain to users where/how you specify the location of the reference data for the PSI calculation so that it will work for the inference mode? I think this is a vital part of your solution. Thanks!

yapweiyih commented 3 years ago

@josiahdavis Added new parameter InferSource in config to indicate new inference data location. Also added parameters MonitorS3Bucket and MonitorS3Prefix to store baseline and inference score for easy tracking.

awslabs / mlmax

Added data monitor feature #86