equinor / gordo

An API-first distributed deployment system of deep learning models using timeseries data to predict the behaviour of systems
GNU Affero General Public License v3.0
26 stars 23 forks source link

RFC: Reflect build-errors back out again #777

Open epa095 opened 4 years ago

epa095 commented 4 years ago

Problem We build some models, we fail others. But to figure out why a model failed you must find the argo workflow containing that model and look at it. Often the exit-code is enough, other times you must look at the log of the pod. How can we expose this information out of the cluster? Thoughts?

Thoughts

  1. I think this should be gathered and exposed this in some reasonable way inside the k8s cluster first, and then we can simply expose this over http using a dumb server. This instead of building a smart http server which know much of the internal k8s setup. Agree?

If we agree on the above point, where in k8s should this information be?

  1. We can create a modelobject for failed models as well containing the status (Failed), and the exit code of the container. This means that kubectl get models dont give working models, but rather desired models, and it can be filtered on the status. gordo-controller can still write some summary-statistic into the gordo (e.g. nr of failed models per exit-code for example), but "the truth" is in the models.
  2. We can create a failed-model object. But this seems quite weird compared to how other k8s objects are handled.
  3. We can store the information about failed models back (and only in) the gordo. So either we write the status/exit code directly back in the config dictionary, or maybe better: add another map (for example in the status field) from model-name to exitcode / status? Then the gordo functions as a kind of log. Problems with this: The gordo is already pressed for size, and this will increase it a bit.

I guess a core question is: Does kubectl get models give desired models or successful models?

ryanjdillon commented 4 years ago

If were to use another service for aggregating logs, I found these while poking around: fluentd and ELK on Kubernetes. I like the idea of having a Kibana dashboard with all the essential deets on broadcast.

With the core question, I like the idea of getting desired models and then grepping , etc.

flikka commented 4 years ago

Maybe @milesgranger have thoughts on this