determined-ai / works-with-determined

This repository contains example integrations between Determined and other ML products
Apache License 2.0
47 stars 28 forks source link

Update hpe-mlds-build-hpcm-os-image.md #12

Closed vishnu2kmohan closed 1 year ago

vishnu2kmohan commented 2 years ago

Ensure that Docker containers relating to the HPE Machine Learning Development System log to journald so that their logs are automatically aggregated to the HPCM Admin node under /var/log/HOSTS/<HPCM_node_name>.

dannysauer commented 2 years ago

Can we not modify the docker config to log all containers to journald? It seems like configuring every individual container leaves it open to potentially missing one somewhere.

vishnu2kmohan commented 2 years ago

Can we not modify the docker config to log all containers to journald? It seems like configuring every individual container leaves it open to potentially missing one somewhere.

I'm reluctant to set the log-driver globally at the Docker Engine level because all container logs (which can easily trigger rate limits and/or drop logs for very large multi-GPU and/or multi-node distributed training trials - esp. if debug mode is enabled) will be shipped to journald which has historically had trouble keeping up with massive log volumes.

dannysauer commented 2 years ago

I did a little digging on performance tuning systemd-journald, and apparently Pottering just doesn't care about logs. So yeah, it's not a great place for high volume, and can't easily be made to be. It might realistically not be the right place for anything that we actually want some near-guarantee of delivery. :)

dannysauer commented 1 year ago

This is fine.gif