OasisLMF / OasisPlatform

Loss modelling platform.
BSD 3-Clause "New" or "Revised" License
40 stars 17 forks source link

Investigate - large file downloads failing from the azure hosted platform #677

Closed sambles closed 2 years ago

sambles commented 2 years ago

Issue Description

We have found an issue with scalable setup when the analysis generates a large (in terms of size) result. For instance, we have a portfolio which generates ~180mb result. This causes the oasis server to crash when trying to call analyses/{id}/output_file/ API

Might be connected to https://github.com/OasisLMF/OasisPlatform/issues/652 if thats the case downloads should fail with any large files that take more than 1 min to fetch.

sambles commented 2 years ago

Note: long lived downloads look to block Daphne (ASGI) causing the server's liveness probe to fail. This may be unrelated to the original report, but still needs fixing.

  Type     Reason     Age                   From     Message
  ----     ------     ----                  ----     -------
  Normal   Pulled     72m                   kubelet  Successfully pulled image "acroasisplatformbenchmark.azurecr.io/coreoasis/api_server:dev" in 580.368232ms
  Normal   Pulled     63m                   kubelet  Successfully pulled image "acroasisplatformbenchmark.azurecr.io/coreoasis/api_server:dev" in 245.076249ms
  Normal   Pulled     18m                   kubelet  Successfully pulled image "acroasisplatformbenchmark.azurecr.io/coreoasis/api_server:dev" in 1.530477015s
  Normal   Killing    14m (x10 over 3h31m)  kubelet  Container oasis-server failed liveness probe, will be restarted
  Normal   Started    13m (x11 over 3h41m)  kubelet  Started container oasis-server
  Normal   Pulling    13m (x11 over 3h41m)  kubelet  Pulling image "acroasisplatformbenchmark.azurecr.io/coreoasis/api_server:dev"
  Normal   Created    13m (x11 over 3h41m)  kubelet  Created container oasis-server
  Normal   Pulled     13m                   kubelet  Successfully pulled image "acroasisplatformbenchmark.azurecr.io/coreoasis/api_server:dev" in 221.088388ms
  Warning  Unhealthy  10m (x23 over 3h32m)  kubelet  Liveness probe failed: Get "http://xx.xx.xx.xx:8000/api/healthcheck/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)