Closed yapweiyih closed 2 years ago
Thanks for the PR @yapweiyih! This is filling a needed gap. Totally agree with having it as a stand-alone module as well. Looks like a great start too. Couple high-level questions:
Can you clarify how this would be used in a Training/Inference pipeline conceptually? e.g., the monitoring would be a step in the StepFunction that comes before preprocessing step in the Training or Inference Pipeline.
How would this code be used from a practical perspective? e.g., Would it require manual copy/paste into the {training,inference}_pipeline_create.py
I just noticed in monitoring.py
https://github.com/awslabs/mlmax/blob/7ba4a036a611657e30c60fe09fa7efabe40a3aeb/src/mlmax/monitoring.py#L271 that there is an if statement for args.mode == train
but not for args.mode == infer
. Are you planning to add one in for infer as well?
Can you explain the core logic of how you are using the PSI calculation in this example? In the README you mention that PSI is for inference data, however, I don't understand relevance of calculating the PSI based on a random train/test split at inference time as it seems you are doing here. Sorry, I'm probably missing something obvious here.
It will come before both training and inference. Training will calculate the baseline statistic, and the PSI between train/test data (should be low PSI score). Inference can just calculate PSI between new infer data and train data (should have low PSI score as well, else should look further into data drift).
Yes, should be some slight copy and paste into train/inference pipeline if required.
Currently only train
is done to calculate baseline statistic, and PSI between train/test. infer
mode can be added to calculate between new infer data (once it is clear where new data is located) and train data.
As per point 1 and 3.
Thanks, helpful explanations. One follow-up:
Currently only train is done to calculate baseline statistic, and PSI between train/test. infer mode can be added to calculate between new infer data (once it is clear where new data is located) and train data.
Can you explain to users where/how you specify the location of the reference data for the PSI calculation so that it will work for the inference mode? I think this is a vital part of your solution. Thanks!
@josiahdavis Added new parameter InferSource
in config
to indicate new inference data location. Also added parameters MonitorS3Bucket
and MonitorS3Prefix
to store baseline and inference score for easy tracking.
Issue #, if available:
Description of changes: This enhancement is to add data monitor support to understand baseline training statistic, and calculate Population Stability Index (PSI) for new inference data. The job is run on a custom container which can be customised as needed.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.