Expose Model Predictions from `diskprediction_local` module

chauhankaranraj commented 4 years ago

Hi team,

I'm a data scientist on the AICoE team. Some time back, we had added failure prediction models to the diskprediction_local module in ceph upstream (PR). Would it be possible to expose predictions from these models on the "devices" grafana dashboard in some way?

Basically, we want to ensure these predictions make their way to users / SMEs / open source community, so that they can provide us some feedback. This feedback would be incredibly valuable for improving existing models, and also for better understanding if the kind of output we're providing is useful or is there something better that we can provide. Let's have a discussion here :)

cc @yaarith @MichaelClifford @durandom

durandom commented 4 years ago

@yaarith @yehudasa any suggestions on how to move this forward?

yaarith commented 4 years ago

Thanks for bringing this up; Indeed, we plan to add a disk failure prediction to the devices dashboard. We currently contemplate whether to implement it on the server side, rather than sending the data from the client side.

durandom commented 4 years ago

@yaarith I think it's already implemented on the client side, i.e. the code is built into ceph - see the PR that @chauhankaranraj linked above. IIUC all the information (smart stats) is already being sent via telemetry, it's just a matter of applying the model and exposing the result somewhere. Or am I missing something?

yaarith commented 4 years ago

Of course, the disk failure prediction is already implemented on the client side and built into Ceph. Indeed, the SMART data is collected on the client side, then sent via telemetry.

What I meant was to generate the prediction on the server side, instead of sending the prediction result along with the device's telemetry.

Thinking long term - this way we can:

cater to all devices sharing telemetry (and not just these which phone from a Ceph cluster that has a failure prediction model built-in)
simplify how we switch between models: In case we wish to compare results of different prediction models, it might be easier to do it on the server rather than suggesting it to the user on the client side.

Hope it makes more sense now :-)

durandom commented 4 years ago

@yaarith ok. then we're thinking along the same lines. Can you point us in the right direction where we would plug in the code for running the prediction server-side? I guess it's all in this repository here?

yaarith commented 4 years ago

@durandom Yes, it's going to be in this repo. The device's PR will soon be available here. We still contemplate how to trigger the prediction on the server side; we probably want to do it on demand when the device's x-ray page is loaded, but that's a bit of a challenge in Grafana. Let me update you with the design once we are settled on it. For now we can start with a command line tool that, given a disk id and its history in the database, returns its prediction.

Regarding the prediction itself, we basically wish to have:

A model that predicts the read error rates (or any error / attributes set that can serve as a good indication of the disk's health). The model would look at time series and predict the future trajectory of the read error rate for the next x days.
A model that predicts the probability of a catastrophic failure over time.

A possible scenario is that a user is willing to tolerate a disk with up to 5 read errors: Model 1 lets them know that the error rate will probably reach 5 in 9 days; Model 2 lets them know that the chance the disk is about to meet its makers (vendors...?) is 90% in 11 days;

We don't have enough data yet from our devices telemetry for this scale of model training, so Backblaze's data can help with training (at least hard disks).

chauhankaranraj commented 4 years ago

Thanks for the feedback @yaarith, I like these ideas.

For now we can start with a command line tool that, given a disk id and its history in the database, returns its prediction.

Could this just be a python script? Also just to clarify, is this something that the ceph team is planning on working on, or should the AIOps team work on this?

1. A model that predicts the read error rates (or any error / attributes set that can serve as a good indication of the disk's health). The model would look at time series and predict the future trajectory of the read error rate for the next x days.

Currently, we don't have a forecasting model like you've described. For the next steps, I could start looking into building one using backblaze data.

2. A model that predicts the probability of a catastrophic failure over time.

For this, I think we have two options.

Use the model trained on backblaze (the one on upstream rn)
Use the model trained on internal data. Although, since it hasn't been trained on a lot of data and the "labels" are possibly inaccurate, I think this would be less accurate than the backblaze one. Nonetheless, if we use this, then the SME feedback on it could be used to improve the labeling in our internal dataset, which will be beneficial long term.

What do you think?

yaarith commented 4 years ago

Sounds good, @chauhankaranraj!

Could this just be a python script? Also just to clarify, is this something that the ceph team is planning on working on, or should the AIOps team work on this?

Sure, it could be a python script. I think it would be a joint effort. Obviously, you can do the AI magic, and we can do the integration.

A couple of questions please:

How many data points are needed in each model in order to generate the outcome?
Do the models consume the entire device's history on each run, or do they keep a state?

For example: We want to predict the health of a random hard disk in the database. The disk sent a daily report with its health metrics every day for the past 100 days. How many reports should we use in model 1 (which predicts the error rate)? How many in model 2 (which predicts the chance of a catastrophic failure)? Is there a minimum to no maximum (e.g. we need at least 7 reports, but it would be better to use as much as possible)? Are there any states between runs that should be stored in either models? Is it optional?

Currently, we don't have a forecasting model like you've described. For the next steps, I could start looking into building one using backblaze data.

Sounds good! Please let me know if you have any questions.

Use the model trained on internal data. Although, since it hasn't been trained on a lot of data and the "labels" are possibly inaccurate, I think this would be less accurate than the backblaze one. Nonetheless, if we use this, then the SME feedback on it could be used to improve the labeling in our internal dataset, which will be beneficial long term.

There are several issues with the current device telemetry data, for instance:

Some dev clusters disks report for short periods of time, then disappear.
Many reports don't contain SMART data (due to old smartctl version, sudoers issues, etc.).
There are many SSD and NVMe devices, which require different models.

I don't expect the telemetry data to be too useful for this purpose at this point, so Backblaze's data is to the rescue in this case too :-) That said, we might be able to use telemetry for SSD and NVMe training sometime in the future.

chauhankaranraj commented 4 years ago

I think it would be a joint effort. Obviously, you can do the AI magic, and we can do the integration.

For the "AI Magic" part, IIUC the existing upstream model already does what we want for "Model 2" - i.e. given 6-12 days of SMART data from a device, predict device health as good (likely to live >6 weeks), medium (likely to fail in 2-6 weeks), or bad (likely to fail in <2 weeks). So it seems like it's a just matter of running it and showing the results on grafana? Did I understand this correctly?

For the integration bit (wrapping a cmd line tool around the model), if it unclear what the model takes as input or what it produces as output, I'm happy to go through that :)

How many data points are needed in each model in order to generate the outcome?

The upstream model requires at least 6 days of data to generate a prediction. If there's >6 days of data, then it won't throw an error, but won't really make any good use of that extra data either.

Do the models consume the entire device's history on each run, or do they keep a state?

It doesn't save state across runs, and doesn't consume the entire history either. Just the most recent 6 days.

This design choice was made because module.py, which is the script that initializes and calls the models, sends 6-12 days of SMART data to the models. It throws an error if there's <6 days (this line), and ignores data that is >12 days (this line). This is something that already existed upstream, I'm not sure if it was contributed by ProphetStor or someone else in the ceph community. So I didn't modify this file, but instead adapted the models to take the kind of input (6-12 days SMART data) that it already provides.

Let me know if something doesn't make sense or if this information failed to answer any of questions :)

For the next steps, does the following sound reasonable:

ceph team could work on integrating model outputs to grafana, with aiops providing support where needed
aiops team could start looking into training a SMART metric forecasting model ("Model 1") using backblaze data. For this, we'd need SME input on what metric should we forecast (read errors? write errors?), but for now we can choose an arbitrary metric as a placeholder, and then later on change it according to feedback.

chauhankaranraj commented 4 years ago

For the next steps, does the following sound reasonable:

ceph team could work on integrating model outputs to grafana, with aiops providing support where needed

aiops team could start looking into training a SMART metric forecasting model ("Model 1") using backblaze data. For this, we'd need SME input on what metric should we forecast (read errors? write errors?), but for now we can choose an arbitrary metric as a placeholder, and then later on change it according to feedback.

Hey @yaarith, what do you think of these suggestions? Could I get an ack or nack please? :smiley: cc @MichaelClifford @durandom

yaarith commented 4 years ago

Hi @chauhankaranraj,

I have a simplified version of the integration piece ready: https://github.com/yaarith/ceph-telemetry/commit/8e935c724b2ba95e68a6ad9d8c1976f833ce9066#diff-eae904122631a212414a644c37a483b8

predict_device.py is a command line tool which receives a device_id (note it’s not the uuid, but an internal database serial id), and currently returns its (very simple) failure prediction:

$ ./predict_device.py 1669
Prediction result for 1669 is: PreditionResult.FAIL

In model.py I added a placeholder for the actual model you wrote. Can you please inject its code there? Input samples of SMART data to your current model are found in /input_samples.

There’s a sample Grafana (currently private) dashboard to present these prediction results:

After you integrate your code we can display the results of the that model.

So it seems like it's a just matter of running it and showing the results on grafana? Did I understand this correctly?

In this model (the existing one, or "Model 2") we wish to get finer granularity (instead of “good”, “medium”, “bad”) or more fine grained time prediction, if possible.

Regarding Model 1, is it possible to tell:

How many data points will be needed for it to generate the outcome?
Will it need to keep a state?

ceph team could work on integrating model outputs to grafana, with aiops providing support where needed

The infrastructure is ready from our side, we'll be happy to see your code injected :-)

aiops team could start looking into training a SMART metric forecasting model ("Model 1") using backblaze data. For this, we'd need SME input on what metric should we forecast (read errors? write errors?), but for now we can choose an arbitrary metric as a placeholder, and then later on change it according to feedback.

Great, definitely Backblaze's data. Who were the SMEs to help you with the existing model? Can we consult with them regarding the new model?

chauhankaranraj commented 4 years ago

I have a simplified version of the integration piece ready: yaarith@8e935c7#diff-eae904122631a212414a644c37a483b8

predict_device.py is a command line tool which receives a device_id (note it’s not the uuid, but an internal database serial id), and currently returns its (very simple) failure prediction:
$ ./predict_device.py 1669
Prediction result for 1669 is: PreditionResult.FAIL

This is great! Thanks a lot Yaarit, much appreciated! :pray:

In model.py I added a placeholder for the actual model you wrote. Can you please inject its code there?

Will do :)

Input samples of SMART data to your current model are found in /input_samples.

I'm not up to date with developments in smartmontools so you'd likely know better. But it seems to me that the jsons in ./input_samples are a bit different than smartctl jsons (different key-value structure, normalized values not present). Is this intentional or were they supposed to look exactly like smartctl jsons?

In this model (the existing one, or "Model 2") we wish to get finer granularity (instead of “good”, “medium”, “bad”) or more fine grained time prediction, if possible.

Getting a finer time-to-failure granularity is bit tricky for the existing model, since it's a classification model (vs regression model). If we want that, then we'd have to modify the training setup and train an entirely new model instead. That said, it is possible to get finer insight into failure using the existing model. In addition to the prediction, we could also show the confidence in it. E.g. "model is 80% confident that the device is warning (4-6 weeks till failure), 15% confident that device is bad, and 5% sure it is good". Would that be a reasonable middle-ground?

How many data points will be needed for it to generate the outcome?

I think 6 data points, one from each day in the past. Having more data could be helpful, but if we want to leave the door open for integrating "Model 1" to upstream, then we need to ensure compatibility with existing codebase. That is, make sure it works when there's 6 days data only. So atm generating outcome using 6 days data sounds more appealing to me.

Will it need to keep a state?

AFAICT it won't need to save state :)

yaarith commented 4 years ago

The JSON structure of the SMART attributes in the sample files is different since it’s generated from a database table which holds the attributes. This allows for higher efficiency and performance in fetching and processing.

The input format is a dictionary, where the keys are the timestamp of the SMART metrics scraping (e.g. “2020-07-20 00:07:47”). These timestamp keys are sorted (ascending). Each timestamp holds an “attr” key, which is a dictionary. The (also sorted, ascending) keys in this “attr” dictionary are the SMART attribute id ("1", "3", etc.). The value is another dictionary which holds 2 keys: the attribute's name, and raw value.

For example: { "2020-07-29 00:07:47": { "attr": { "1": { "name": "Raw_Read_Error_Rate", "val_raw": 6336408 }, "3": { "name": "Spin_Up_Time", "val_raw": 0 }, "4": { "name": "Start_Stop_Count", "val_raw": 16 }, ... } }, "2020-07-30 00:12:09": { "attr": { "1": { "name": "Raw_Read_Error_Rate", "val_raw": 13545260 }, "3": { "name": "Spin_Up_Time", "val_raw": 0 }, "4": { "name": "Start_Stop_Count", "val_raw": 16 }, ... } }, … }

Regarding the normalized values: Does your model utilize them regardless, or only in the case where the raw data is not present?

Getting a finer time-to-failure granularity is bit tricky for the existing model, since it's a classification model (vs regression model). If we want that, then we'd have to modify the training setup and train an entirely new model instead. That said, it is possible to get finer insight into failure using the existing model. In addition to the prediction, we could also show the confidence in it. E.g. "model is 80% confident that the device is warning (4-6 weeks till failure), 15% confident that device is bad, and 5% sure it is good". Would that be a reasonable middle-ground?

It would be great if it's possible to reach a finer granularity in both time and confidence axes. I have the user in mind, and we wish to deliver a better assessment of the disk health's state.

Having more data could be helpful, but if we want to leave the door open for integrating "Model 1" to upstream, then we need to ensure compatibility with existing codebase.

I wouldn’t worry about the compatibility adjustments, since we care more about having a more accurate prediction :-)

AFAICT it won't need to save state :)

In case we choose to use significantly more data points (other than 6-12), is it possible to keep states between runs?

chauhankaranraj commented 4 years ago

The JSON structure of the SMART attributes in the sample files is different since it’s generated from a database table which holds the attributes. This allows for higher efficiency and performance in fetching and processing.

Gotcha, thanks for the clarification! :)

Regarding the normalized values: Does your model utilize them regardless, or only in the case where the raw data is not present?

Yes, the model uses both raw and normalized SMART values

It would be great if it's possible to reach a finer granularity in both time and confidence axes. I have the user in mind, and we wish to deliver a better assessment of the disk health's state.

Ah okay, understood. Would it make sense to add these optimizations and finer predictions incrementally (do I hear a "release early release often")? That is, start with surfacing the existing coarse grained model predictions ("Model 2"), then build the "Model 1" and integrate that, and then work on making it finer grained and more accurate. My rationale was, we don't know for sure yet which type of model (Model 1, Model 2) would the users find more helpful and which one would they want to see improved. So it might be a good idea to focus our optimization efforts based on where they're needed the most. Does that sound reasonable?

In case we choose to use significantly more data points (other than 6-12), is it possible to keep states between runs?

Hmm I think saving state within the models might be difficult. Is there are a specific reason we'd want to do this, instead of just saving state to the database?

yaarith commented 4 years ago

Yes, the model uses both raw and normalized SMART values

I'll add them to the sample files. How does the model make use of both raw and normalized values?

Would it make sense to add these optimizations and finer predictions incrementally (do I hear a "release early release often")? That is, start with surfacing the existing coarse grained model predictions ("Model 2"), then build the "Model 1" and integrate that, and then work on making it finer grained and more accurate. My rationale was, we don't know for sure yet which type of model (Model 1, Model 2) would the users find more helpful and which one would they want to see improved. So it might be a good idea to focus our optimization efforts based on where they're needed the most. Does that sound reasonable?

Sounds good!

Hmm I think saving state within the models might be difficult. Is there are a specific reason we'd want to do this, instead of just saving state to the database?

I meant to keep states between runs (when applying a model) in the database :-)

chauhankaranraj commented 4 years ago

How does the model make use of both raw and normalized values?

So the Backblaze dataset contains both raw and normalized values for each SMART metric. I.e. it has two columns per SMART metric, one containing the raw value and other containing normalized. So the model takes as input both the raw and normalized values for a given set of SMART metrics. That is, treats them as separate features. Does that answer your question?

p.s. for a glimpse of what that dataframe looks like, you could check out the last cell of this notebook.

yaarith commented 4 years ago

I wonder how the model makes use of the normalized values in a sense of which insights are derived from them, compared with the raw ones. Do they have weights? Which value is considered more “reliable” in training and applying the model, for instance: smart_1_raw which is 148579464, vs. smart_1_normalized which is 117?

tchaikov commented 4 years ago

@yaarith and @chauhankaranraj sorry to jump into the middle of your discussion, i just learned from Rick that they stopped offering the cloud-based disk healthy prediction service. see https://github.com/ceph/ceph/pull/36557#issuecomment-673297190 .

so based on your discussion, it seems we are interested in making the disk prediction on the server side of ceph-telemetry. to me, i see it as our own cloud-based disk healthy prediction service. am i right? if that's the case, does this imply that instead of removing the ceph-mgr-diskprediction-cloud from ceph, we can reimplement it with the service offered by, for instance, https://telemetry.ceph.com/device ?

yaarith commented 4 years ago

@tchaikov thanks for the update!

Yes, sounds good :-)

We need to discuss data sharing licensing. Currently telemetry data is under https://cdla.io/sharing-1-0/, we might need to add this when opting-in to the service, depending on how it operates. We might need to have two tiers of licenses - one with sharing, and one without.
Currently the telemetry module sends metrics to the telemetry server, but does not receive a reply. The idea was to display statistics and predictions via a Grafana dashboard on the server side. So for displaying the prediction on the client side we'll need to implement the "replying" part.

chauhankaranraj commented 4 years ago

I wonder how the model makes use of the normalized values in a sense of which insights are derived from them, compared with the raw ones. Do they have weights?

Yep you're right, they do have weights. Please see the attached screenshot showing feature importances (weights) for the top 15 features.
Screenshot from 2020-08-17 12-55-27

Which value is considered more “reliable” in training and applying the model, for instance: smart_1_raw which is 148579464, vs. smart_1_normalized which is 117?

I think it can't be guaranteed that features derived from raw values are always more reliable than than those derived from normalized values, or vice versa. For some SMART metrics (e.g. 5, 7) raw values are more important than normalized values, but for some others (e.g. 187, 190) raw values are less important than normalized values.

yaarith commented 4 years ago

@chauhankaranraj thanks!

Here are the updated sample files with the normalized SMART values: https://github.com/ceph/ceph-telemetry/commit/067ff2cadbde046e2653cb051f27e42cc3409c88

chauhankaranraj commented 4 years ago

Here are the updated sample files with the normalized SMART values:

tysm @yaarith, much appreciated :smile:

Is there a way we can get the values stored in the user_capacity key of smartctl json's as well? That's the only other thing that the model requires that's not currently in sample data.

yaarith commented 4 years ago

Hi @chauhankaranraj,

I pushed the changes here: https://github.com/ceph/ceph-telemetry/commit/6fe47766e3f4f9a71f59673be153494a1c9492c2

Since capacity is the same per device, I put it in a key per an entire input sample ('capacity_bytes'), and not per each scarping date. Let me know if you have any questions.

ceph / ceph-telemetry

Expose Model Predictions from `diskprediction_local` module #13