apoorvkh / torchrunx

Automatically initialize distributed PyTorch environments
https://torchrunx.readthedocs.io
MIT License
1 stars 0 forks source link

Switch from "printing" to "logging" #30

Closed apoorvkh closed 2 days ago

pmcurtin commented 1 month ago

Should we use Python's built-in logging? we could also use some third-party logging library - not sure if you have any favorites

apoorvkh commented 1 month ago

Yes, we should use Python's built-in logging.

I saw this video a while ago, but I am no expert and there may be a better resource. https://www.youtube.com/watch?v=9L77QExPmI0

We should certainly keep it as simple as possible.

apoorvkh commented 1 month ago

Perhaps we should stream all logs up from the workers -> agents -> launcher and then we can choose how to filter and save logs at the launcher-level, such that the user can customize this. We can include some default behaviors and utilities.

pmcurtin commented 1 month ago

I think I'll adapt this solution: https://docs.python.org/2.7/howto/logging-cookbook.html#sending-and-receiving-logging-events-across-a-network

apoorvkh commented 1 month ago

Sounds good -- just be aware those are Python 2 docs :)

apoorvkh commented 1 month ago

also fyi, I keep seeing that print statements appear at the end of the agent log / launcher terminal, whereas logger.info statements appear in the "realtime" order

we can see about that after the current logging adjustments

pmcurtin commented 1 month ago

That's pretty strange. Well, yeah we'll see what it's like once we get logging working.

pmcurtin commented 1 month ago

@apoorvkh Should we still provide a way for worker output to be captured? Else I can just remove all the existing log stuff and we can completely depend on the logging library.

apoorvkh commented 1 month ago

I was thinking that all the workers (e.g. in entrypoint) and agents (e.g. in main) would stream their logs directly to the launcher. Was there something else you had in mind?

pmcurtin commented 1 month ago

That's what I'm doing. I was just curious whether you thought stdout/stderr should also be captured. idk, I guess code in libraries that people might use could output to those streams rather than log? for now I'll ignore it.

apoorvkh commented 1 month ago

stdout/stderr should also be captured

Oh is there a difference between: logging output and std output?

Yeah, I think we should also be streaming the stderr/stdout.

pmcurtin commented 1 month ago

Oh is there a difference between: logging output and std output?

Well, the way we're streaming the logs is a little more advanced than just sending strings from the workers/agents to the launcher. We're sending LogRecord objects which get processed by the logging library. These objects are only created when people log using logger.debug(msg)...

Yeah, I think we should also be streaming the stderr/stdout.

Ok! Just for single worker, perhaps? The launcher process should print this?

apoorvkh commented 1 month ago

Hmm, I guess I don't understand enough about how you're implementing this. I think we need to talk about it in more detail tomorrow.

logger.debug(msg)

I think we should be streaming all logging events, not just debug. Alternatively, maybe we don't need to stream "logging events", but just stderr / stdout, which should include logging statements (at whichever granularity is specified).

Ok! Just for single worker, perhaps? The launcher process should print this?

No, I think we should stream everything. Later, we can include a filtering mechanism to choose what the launcher process will print.

apoorvkh commented 1 month ago

I guess it's good to stream "logging events", rather than exclusively just an stdout/err byte stream. Because that will offer deeply granular control for filtering outputs from the launcher.

apoorvkh commented 1 month ago

Also, maybe we can just redirect stdout to logging

E.g. https://stackoverflow.com/questions/11124093/redirect-python-print-output-to-logger