Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.35k stars 3.38k forks source link

Progress bar for non interactive environments #12185

Open austinmw opened 2 years ago

austinmw commented 2 years ago

🚀 Feature

I'd like to request a progress bar for non-terminal, non-interactive environments, like for example Amazon CloudWatch. The TQDM and Rich progress bars are not ideal for tracking progress in these more simple logging environments.

Additionally, I'd like to request the ability to automatically adapt to a known-length or an unknown length iterable dataset, by displaying percentage based progress in the former case, and simply iteration count in the latter case. (It appears that the default TQDM progress bar produced errors for me with an unknown length iterable webdataset.)

Motivation

I'm training on both EC2 and Amazon SageMaker, and neither the TQDM nor the Rich progress bars seem appropriate for monitoring progress in CloudWatch. I asked on slack if a more appropriate alternative or configuration setting was available, and was advised that there's not anything currently, but I should create a GitHub issue for one.

Pitch

At the most basic level, the progress indicator would print something like this on rank 0: [Epoch 0 | Iteration 100] train_loss: 0.01, val_acc: 0.74

Another useful feature would be optionally passing a logger to use instead of print, so for example if you pass a logger that writes to stdout: LoggerName - INFO - [Epoch 0 | Iteration 100] train_loss: 0.01, val_acc: 0.74

If the epoch has a known total number of batches, additional information on the completion percentage would be nice. For example: [Epoch 0 | Iteration 100/450 (22.22% complete)] train_loss: 0.01, val_acc: 0.74

Alternatives

I guess whenever anyone needs more simple progress tracking they could write their own custom progress meter based on ProgressBarBase.

cc @borda @carmocca @awaelchli

rohitgr7 commented 2 years ago

from my experience with cloudwatch, one needs to push a print statement/log to track something. I used to do a simple print, but my use-case of non-ml related.

we should investigate if there's any open source project doing it and think of integrating within lightning or possible solutions to build one of our own by investigating possible ways to log to cloud watch or some other monitoring space.

austinmw commented 2 years ago

Yeah I set the stream to stdout so it functions like print. I do this before importing pytorch lightning (it would be nice if the order didn't matter too):

import logging

logging.basicConfig(
    format='%(asctime)s [%(levelname)s] %(message)s',
    level=logging.INFO,
    datefmt='%H:%M:%S',
    stream=sys.stdout,
)
log = logging.getLogger('SageMaker')
log.setLevel(logging.DEBUG)

I would say that using a cloudwatch specific logging utility might be good for EC2, but probably not for SageMaker, since with SageMaker, anything that prints to stdout automatically gets sent to cloudwatch logs under the dynamic training job name, so this could create weird duplicate logging.