Unify usage of multiple callbacks

Borda commented 4 years ago

🚀 Feature

Simplified API, with callbacks... as e.g. Keras did, pass just list of callbacks to be executed and Trainer will call then when needed instead of having them specified https://github.com/PyTorchLightning/pytorch-lightning/blob/b1040523b2180300574d961444b00abfa3c84195/pytorch_lightning/trainer/trainer.py#L65-L66

mentioned also in https://github.com/PyTorchLightning/pytorch-lightning/issues/825#issuecomment-588226411

Borda commented 4 years ago

@ethanwharris lest split the discussion and talk about callbacks here...

hadim commented 4 years ago

I don't plan to incorporate checkpoint_callback and early_stop_callback callbacks in the new system to avoid making this #889 too big but that should be done in a future PR.

Same for the loggers logic, they should use the callback system IMO (all the custom callbacks I use are loggers...).

Borda commented 4 years ago

Same for the loggers logic, they should use the callback system IMO (all the custom callbacks I use are loggers...).

interesting idea, and true that logger is a special case of a callback, just not sure if I won't be too confusing for a user...

hadim commented 4 years ago

For the user, it's easy to make this change invisible. Just make the Logger class inherit from Callback and modify the appropriate loggers to use the new mechanism instead of the custom logger's functions. Everything should be backward compatible.

Borda commented 4 years ago

sure making the parents of loggers does not disturb any user, I was more-less thinking about passing callbacks and lobber in the same argument (list) instead of having an argument for callbacks and loggers separately

hadim commented 4 years ago

But loggers and callbacks are the same things so I don't see any issue and new users should no be disturbed IMO.

About existing users and backward compatibilities, if the documentation is clear about it, everything should be fine.

ethanwharris commented 4 years ago

I'm not convinced that there is an advantage to making Loggers into Callbacks. Currently there is a lot of inter-dependency between checkpointing and logging (the checkpointer directory depends on logger name / version for example), I can see that getting quite messy if everything is of the same class and invoked with the same command. There's also the problem of callback hell, where everything can become a callback if the API is flexible enough, which we should probably avoid. That said, I'm not against the idea if it doesn't disturb the user, just not sure what value it would bring in that case.

Borda commented 4 years ago

@PyTorchLightning/core-contributors what do you think? is it reasonable/beneficial? :robot:

jeremyjordan commented 4 years ago

my two cents:

sure making the parents of loggers does not disturb any user, I was more-less thinking about passing callbacks and lobber in the same argument (list) instead of having an argument for callbacks and loggers separately

I think it will be more user friendly to have the trainer keep these separate (eg. Trainer(..., logger=True, callbacks=[]))

There's also the problem of callback hell, where everything can become a callback if the API is flexible enough, which we should probably avoid.

+1

as I mentioned in #595, I think we should allow user defined callbacks passed in as a list and execute the callbacks in the order which they're provided in the list.

Trainer(..., callbacks=[custom_callback_1, custom_callback_2, model_checkpoint_callback])

one thing that we'll want to consider is what information we'll expose to the callbacks. do we want to expose all model attributes?

williamFalcon commented 4 years ago

@Borda i'm not sure i love this idea. this means you wouldn't get early stopping for free or checkpointing for free. most of the time users don't care about the details of those and a simple True is enough. This means now you'd have to specify all the details.

i think this clutters the API and makes it unnecessarily complex. We moved the callbacks to default arguments after a ton of feedback on people being annoyed they had to init them EVERY time.

and frankly, the keras callbacks are super unintuitive and confusing.

So, unless there's a clear API win, i'm going to push back on this.

hadim commented 4 years ago

I think keeping a True argument for things like a logger, early stopping, checkpointing, etc is important as it allows a to have a good default behavior without having to define manually everything. Adding a list of callbacks is just an extra argument that gives more control and flexibility.

Both things are not exclusive.

williamFalcon commented 4 years ago

but then we have multiple ways of doing something which is confusing. ie: do i pass in the callback to early stopping or the callbacks list?

having many choices for something is bad and against our principles. We want to force user to make a single decision rooted in best practices and sensible defaults.

Borda commented 4 years ago

I was thinking not breaking the APPI as it is not so there will be still an option to set a checkpoint true, but inside the trainer, it would create a callback instance which will be added to list with others passed via callback argument...

jeremyjordan commented 4 years ago

@Borda I was thinking something similar, either appending the early stopping and model check pointing callbacks at the beginning or end of the list.

callbacks = [*user_provided, model_checkpoint, early_stopping]

I agree that losing early stopping and model check pointing “for free” would be an undesirable regression.

Moreover, as @williamFalcon pointed out there are still some confusing cases that need to be ironed out before considering moving forward with this.

As an example, right now you can construct your own ModelCheckpoint object and pass it in to the checkpoint_callback arg. If we also have a callbacks arg in the Trainer, how will the user know where to provide their ModelCheckpoint object? Should the checkpoint_callback arg only accept True/False for setting default modes but user provided objects go into the callbacks arg?

To quote the Zen of Python: “there should be one — and preferably only one — obvious way to do it.”

jeremyjordan commented 4 years ago

What if we started by simplifying how the internal callbacks are executed, make it easier for contributors to add callbacks to the official library, and revisit user-define callbacks at a later date?

Right now, the callbacks must be manually invoked across the Trainer code. Here's an proposal:

Keep the trainer args as is (eg. checkpoint_callback, early_stop_callback, accumulate_grad_batches)
Internally, construct a callbacks attribute based on the above argument values (eg. callbacks = [early_stop_callback, checkpoint_callback, ...]
Build a CallbackHandler object which invokes these callbacks, in the order provided, at each model hook.

This way, if a new user wants to contribute a Callback to the library, they simply have to write the Callback class and determine where in the list of callbacks it should be placed (order of execution).

jeffling commented 4 years ago

What if we started by simplifying how the internal callbacks are executed, make it easier for contributors to add callbacks to the official library, and revisit user-defined callbacks at a later date?

I like that, although it would be incredibly useful to go all the way and expose the default callback stack in a way that could be overridden (ie defined in LightningModule).

Anecdotally, I noticed some users are already hacking their own callbacks in by editing Trainer code. Ideally this solution could get us to a point where the number of users overriding Trainer is at a minimum.

williamFalcon commented 4 years ago

@jeffling could u give an example of such a hack?

ethanwharris commented 4 years ago

I think there are some issues with the idea of a CallbackHandler that we should try to avoid.

If all of the callbacks can be handled in the same object - they must all use the same method names. This leads to things like on_forward, on_backward etc. It's surely easier for the user to extend LoggerBase, or CheckpointBase... and just implementlog_metricsorsave_checkpoint` and so on
If this is only under the hood then why do it? To understand the order callbacks are called in you now have to go back to where the handler is initiated rather than just looking at the code - so we lose some readability and gain nothing in terms of usability

I would suggest that there should be well defined and easy to implement abstractions for standard operations (logging, checkpointing, ...), which are called separately in the codebase via their own Collection classes. If we need more flexibility than that, then we can have callbacks that are more general, although I'm not sure what the use case is.

My concern is that trying to be too flexible is what led to an unusable callback or event handler system in: torchbearer, ignite, catalyst, keras or a host of other libraries. We should be aware that all of those libraries got it wrong, and try to avoid making the same mistakes.

festeh commented 4 years ago

Such use cases definitely exist, mine for example is when one wants to add a test, which checks something at runtime. Here I had to hack a module instead of passing a callback. Also, if you check into this libraries you'll find some examples, that you cannot use in lightning now as a callback. I believe many of users have their own cases.

My proposal is to add a new class, named for example CallbackTrainer, which does not have logger, early_stop_callback etc. arguments, but instead has a callbacks argument. It should give user the desired flexibility in exchange for free features.

ethanwharris commented 4 years ago

Yeah, I can see that there are good use cases for custom callbacks, I just don't think we should bundle them all into one class along with loggers and checkpoints and the rest. My proposal would be that we have a LoggerCollection type thing (see #903), an equivalent for checkpoints etc. and then an additional object for custom callbacks that don't fit one of the other archetypes. Loggers, Checkpointers, Callbacks etc. can then be given to the Trainer as lists or on their own or as booleans. Internally, we should either have a CallbackHandler as suggested above (but with clear seperate arguments for loggers, checkpointers, callbacks etc. that are then called seperately in the code), or just invoke each of the collections when needed in the Trainer. The idea of a CallbackTrainer might be a nice thing to do, but shouldn't be strictly needed to get the right outcome.

festeh commented 4 years ago

Personally, I like your idea, it also should be easier to implement and maintain than new Trainer class. My proposal was in response to

but then we have multiple ways of doing something which is confusing. ie: do i pass in the callback to early stopping or the callbacks list?

which does no posess such ambiguity.

hadim commented 4 years ago

The limitation of the current system is that it does not give control on what should be logged and when as in a callback system where a hook is called during the different steps of the loop such as on_train_begin, on_test_end, etc.

Here the only method is log_metrics and we do not have control over when this is called.

For example, I have currently a callback system embedded in my module (which is bad but hopefully temporary...) with the following loggers:

a python file logger that writes some information on_fit_begin and then at each on_epoch_end
a matplotlib logger than write a plot at each x epoch interval on_epoch_end
a CSV logger that writes on disk at each on_epoch_end and on_test_end but in different files (because test metrics only have one epoch).
a custom nested progress bar that keeps a global track of the progress (across all epochs), so it needs to log at on_fit_begin, on_fit_end and on_epoch_end.
a custom tensorboard logger (because I need to log training and validation metrics in different TB logging file).

I insist on the fact that defaults should be kept as it is now because they are great and make the training loop setup easy and quick but a certain level of flexibility should be given in order to:

Give the user control over when something should be logged or done (if we generalize the logic): on_fit_begin, on_epoch_end, etc.
Allow having multiple loggers/callbacks for the same training loop.

Here is a humble proposal https://github.com/PyTorchLightning/pytorch-lightning/pull/889

As about the callback hell, I don't really understand what it means if you could give a specific example? IMO, it's just a matter of designing a good API, then you obviously don't have control over how the user uses it...

hadim commented 4 years ago

About the confusion between checkpoint_callback and a checkpoint_callback in a callbacks list, we could or raises an error or print a warning if such cases arise.

jeremyjordan commented 4 years ago

If all of the callbacks can be handled in the same object - they must all use the same method names. This leads to things like on_forward, on_backward etc. It's surely easier for the user to extend LoggerBase, or CheckpointBase... and just implement log_metrics or save_checkpoint and so on

I think we may be talking about different things here. If we want to expose customization for logging and checkpointing, then I agree it's absolutely simpler to have users just implement log_metrics or save_checkpoint. However, I don't anticipate customization for logging and checkpointing to be a common use case for custom callbacks.

@hadim does provide a good example, though, with the desire to have control over multiple loggers and their respective behavior. It's a good example in the sense that he's not calling each logger uniformly (ie. do the same action at the same time across all the loggers) which doesn't look like it'd be supported by a LoggerCollection. It seems like you could achieve this by logging through the model hooks (correct me if I'm wrong, please!) provided that we had a way to access multiple loggers.

class ExampleModel(pl.LightningModule):
    def __init__(self, ...):
        pass
    def on_train_start(self):
        self.logger.log_hyperparams(self.hparams)    # by default, use root logger
        self.logger['FileLogger'].save(x)            # get logger by name
    def on_epoch_end(self):
        self.logger['MatplotlibLogger'].save(x)
        self.logger['CSVLogger'].log_metrics(metrics)

I would imagine that the desire for a Callback would be a situation where we want to have more distinct separation of code logic. Is this what other people would expect as well?

My concern is that trying to be too flexible is what led to an unusable callback or event handler system in: torchbearer, ignite, catalyst, keras or a host of other libraries. We should be aware that all of those libraries got it wrong, and try to avoid making the same mistakes.

Could you elaborate on what those libraries got wrong? Where were the pain points? I've used callbacks in Keras (sparingly) and found it to be useful. I think the most common way to enter callback hell is to have a rigid framework which leads to callbacks being the only place for flexibility. Fortunately, PyTorch Lightning is quite flexible so I don't worry about this case so much.

In my opinion, we should collect more information on the following two points:

What are the weaknesses of callbacks and eventhandler systems in torchbearer, ignite, catalyst, keras or a host of other libraries?
What are concrete use cases where users would like callback functionality?

ethanwharris commented 4 years ago

Haha, yeah I could have been more clear there. Ok, so I'll try to summarise my thoughts on this and then bow out. At least for now, the callback system in #889 can be merged as it's definitely an improvment on what is there currently. I suppose my point is that there are several design patterns we could opt for here that aren't the same as Keras or #889, so some discussion on which is best might benefit us in the long run.

Patern 1 - Keras

From what I rememer of Keras, you have callbacks with defined hooks (e.g. on_start etc.) which also take arguments (e.g. batch, metrics etc.). So you might have the following:

class MyCallback(Callback):
    def on_end(self, model, metrics):
        model.do_thing(metrics)

The first issue with the above is that it confuses what should happen with when it should happen. In a better API those would ideally be seperate.
The second issue is flexibility - this pattern takes only certain arguments - what if you want to access something else?

Pattern 2 - Torchbearer, FastAI-V2, #889

The pattern used in #889 and FastAI-V2 and that we adopted in torchbearer is based on the above but with improved flexibility. To acheive this you swap arguments for access to some global object which has everything. It looks something like this:

class MyCallback(Callback):
    def on_end(self):
        self.mega_object.model.do_thing(self.mega_object.metrics)

The advantage of this is massive flexibility. If you are also allowed to set properties on the mega_object then there is really nothing you can't achieve.
The first disadvantage is the same as in Keras, there is no distinction between what happens and when it happens. This makes some use cases a bit harder to work with, particularly if you want one thing to run at several different times, you can end up with a lot of code repetition.
The second disadvantage, which I think is much more important, is the massive learning curve. To be able to use this kind of system properly you have to know what is available, when it's available and how to use it - for every single property of the mega_object . This was a problem in torchbearer - I can acheive anything I want to - but no one else can without massive investment in learning the library. I can see this being a problem with lightning, every added Mixin brings a number of new properties, how many before these become impossible to keep track of? This wouldn't be such a problem if only advanced users were using custom callbacks, but in my experience people usually opt for the most flexible option (e.g. implementing a callback), so I expect it won't be long until every lightning project has one or two custom callbacks.

Pattern 3 - Ignite

Ignite uses event handlers. I don't remember all of the details but you could do something like this:

class MyCallback(Callback):
    def __call__(self):
        self.mega_object.model.do_thing(self.mega_object.metrics)

Trainer.add_callback(MyCallback(), Events.END)

This has the advantage that we have seperated the concerns. In the logging at different times example, you write one logging function and then bind it to different points.
The potential downside is that it's less flexible, there might be use cases where we need to modulate what is done at what time, were the above becomes a little painful.
We also still have the learning curve problem

Pattern 4 - Potential option - Fluent Interface

It's also possible to suggest some things that haven't been done before. Personally I like the results you get from employing a fluent interface. It's a bit like a builder pattern without the .build() and with an emphasis on human readable code. It would look something like this:

class MyCallback(Callback):
    def __call__(self):
        self.mega_object.model.do_thing(self.mega_object.metrics)

Trainer(callbacks=[MyCallback().call_on_train_end()]

This again seperates the concerns, but in a perhaps more readable way than ignite.
It still has the learning curve problem, although that could be alleviated if we have callback archetypes (e.g. Logger) which can unpack the right stuff from mega_object and pass it on.

Apologies for the long post and apologies for being difficult. My concern is just that we should be very deliberate in the API decisions that we make. There are downsides to every option, but we should try to avoid being saddled with them by accident. There's also potential to do something new here, that breaks the norm established by keras. Ignite has definite issues, but it's a refreshing take on the problem, which in itself has merit. I guess that we continue down the current path (i.e. merging #889), at least until we have callabcks in lightning, and then think about ways that they can be improved long term.

williamFalcon commented 4 years ago

i like this discussion! i think we’re on to something here. I’d rather debate this some more until we get to a new system that doesn’t have those limitations and strictly improves flexibility without sacrificing readability (in fact we should strive for improving both).

let’s re-focus

I suggest we rephrase this issue from “let’s add callbacks” to “it’s hard to do xyz in lightning. How can we solve for that?” The answer may not be callbacks :). This way we focus on solving the problem instead of finding a use case for an old solution.

These kind of major decisions take time to think about (API design is not easy lol), and we shouldn’t rush this decision.

I think the way to move forward is

describe 5 use cases that are complex or impossible in lightning.
agree on those.
brainstorm solutions not limited by what others have done. ios, web and android paradigms can give good inspiration.

past solutions

I think callbacks in these libraries are bugs and not features. Added to make up for poorly thought out APIs. It’s the catch all to not wanting to think hard about this (like the utils folder in most projects).

I really like @ethanwharris point that just because those libraries did it doesn’t make it correct. I also think they got it wrong. Going back to a callback system defeats the point of lightning and we regress into having a bunch of unstructured files that break readability and reproducibility.

I don’t want to get in the habit of copying other libraries because we assume they did it right... I posit that libraries fall out of flavor because they do the same thing as other libraries. we need to think differently and make our own independent decisions based on what we think the future of frameworks looks like.

Callbacks may not be the answer

I’m still not convinced about callbacks because i don’t see a huge pain point that they solve. please illustrate a few so I can change my mind :)

Why wouldn’t i define some random function i want to execute in the lightning module and call it during one of the hooks?

Isn’t this equivalent?

LightningModule:
    def my_function_to_do_something_crazy():
        do crazy stuff

     def on_epoch_end:
         self. my_function_to_do_something_crazy()

vs

Callback:

    on_epoch_end():
        do crazy stuff

readability is paramount. Monotonic improvements.

However we solve this, keep in mind readability for reproducibility purposes! We’ve made huge headway in making DL code easier to understand... a callback system as proposed in the other libraries now means i have to dig througj files and unrelated functions to figure out what’s going on. You can imagine a gan defined over a sequence of callbacks which would make it extremely difficult to see what’s happening.

Remember that part of our role is to force a style guide for non engineers as well that is as flexible as possible for professional researchers.

williamFalcon commented 4 years ago

@tullie @srush @thomwolf @luiscape any thoughts?

hadim commented 4 years ago

I don't want to have a callback system to do the same way as Keras or another ML library did. I think it's needed because its flexibility-by-design is what makes it able to solve potentially all or at least almost-all use-cases.

For example in #889, as @ethanwharris already said I implement a callback system where no arguments are passed to the callback functions. It's much less confusing to have a global trainer object than a list of obscure's arguments passed to the callback's methods.

I think it's important to describe use-cases (such as what I did already) but what if we miss an important one? or what if a use-case that looks improbable today become standard tomorrow?

PL (Pytorch-Lightning) is already a callback system by default and I think that is why it's so great. Instead of having a single fit() function on which you can only pass a list of callbacks to modify its behavior here you can have two very well separated callback systems:

the current LightningModule where you define your training logic, the science that should be reproducible and serializable.
a general-purposes callback system that IMO should not modify the behavior of the training loop but only do boring things and easily transferable in-between LightningModule.

You are correct all of this can be done in the current LightningModule. This is what I am already doing and it works but it makes the code less readable and doesn't really separate the science from the boring things. Moreover, I would like to easily reuse callbacks in-between different LightningModule.

About the comparison with other libraries, see Keras for example. The main reason why people where using Keras's callbacks to modify the training loop and do crazy stuff was because of the fit() function. This can't happen with PL and the flexible LightningModule training loop anymore.

Or actually it can, people can use LightningModule in a way that is not intended for, they could also fork and hack the Trainer class or they could also use a callback system to modify the training loop itself. But this is not something you can really prevent. What you can do and I think what you are trying to do right now is to prevent such behaviors by default. But it's already done IMO, because of the flexibility of the LightningModule training loop. People won't have to create crazy callbacks anymore.

Please don't refuse a callback system because people haven't used it correctly in the past. Or to be more specific because the library they were using were kind-of forcing them to hack the training loop using the callback system.

You won't prevent people using PL the bad way, the best you can do is to encourage best-practices and with PL you're already doing it by providing a callback-style training loop and have good default options.

API design is thought indeed. But I think it's good to have an in-depth discussion about it as we do here.

justusschock commented 4 years ago

Hey guys, I'm not (yet) a member of the lightning team, but I am a contributor of ignite and answering here on invitation of @williamFalcon .

I strongly agree with @hadim and no matter which system you opt for, you should definitely choose one IMO. This way you can increase your flexibility for the future a lot. Personally, I disagree with the argument, that this decreases readability. Some parts that currently have to be done in the LightningModule don't actually belong there, if you really think, that only the research stuff should be in there. But since currently there is not callback system for the trainer, users don't have another choice.

And if you want to include a callback system, I recommend one that somehow includes what @ethanwharris called mega_object, since without this, the flexibility will be limited and not as attractive for some users as it would be with that object. Along with proper tutorials and examples the learning curve should be okay for the users actually needing the callback.

Of course I really like the ignite like system and would also recommend that one, but I might be a bit biased, since I'm also contributing there.

jeremyjordan commented 4 years ago

I like @williamFalcon 's proposal to focus the discussion on the problems at hand. To bring back @hadim 's example, it seems like that source of dissatisfaction is (1) lack of separation of concerns and (2) inability for reuse across models. One thing that is interesting to note is that the desired functionality is possible today through model hooks - many people have mentioned Lightning is flexible in this manner.

Perhaps the problem could be resolved through more clear documentation and recommendation on design practices? For example, Lightning already leverages Mixins to compose a series of separate concerns. Can we suggest the same thing here?

Define your reusable logic as part of a model hook mix-in.

class CustomHooksMixin(pl.ModelHooks):
    def on_train_start(self):
        self.logger.log_hyperparams(self.hparams)     # by default, use root logger
        self.logger['FileLogger'].save(x)             # get logger by name
    def on_epoch_end(self):
        self.logger['MatplotlibLogger'].save(x)
        self.logger['CSVLogger'].log_metrics(metrics)

And keep the LightningModule focused on model and training definition.

class ExampleModel(CustomHooksMixin, pl.LightningModule):
    def __init__(self, hparams):
        self.hparams = hparams
    def forward(self, data):
        pass
    def training_step(self, batch, batch_idx):
        pass
    def validation_step(self, batch, batch_idx):
        pass
    def validation_end(self, outputs):
        pass
    def configure_optimizers(self):
        pass

hadim commented 4 years ago

Thanks for the proposal that looks interesting but raises multiple issues:

Using mixins in Python is sometimes mandatory as with pl.Trainer given the code complexity but I'd like to avoid to use them when it's not needed. Because you have to use variables and methods that are not instantiated in a given mixin and can create some confusion and could potentially lead to unwanted behavior without rigorous testing as it is done with pl.Trainer. I am not sure Python is the best language for the mixin pattern but as I said in certain situations such as pl.Trainer it's probably the best option.
Since the mixins are part of your pl.LightningModule, pl.Trainer will consider all things happening in your mixins being part of your training loop. For example, in ddp mode you might not want your callbacks to be pickled and sent to multiple nodes but would rather keep that in the main process of the trainer.
What about callback's configuration? Using the mixin pattern you would have to pass the argument of your callbacks and of your pl.LightningModule at the same time. I don't think this pattern is desirable. Same for reproducibility, since IMO callbacks should not be needed for reproducibility and so needed when re-instantiating a pl.LightningModule.

Overall I am not sure the mixin pattern can replace a callback system.

Mixin pattern in pl.LightningModule looks like a nice pattern when you have a complex training loop or/and you want to reuse blocks of logic in-between models. In that situation, it's probably a good solution.

jeremyjordan commented 4 years ago

Because you have to use variables and methods that are not instantiated in a given mixin and can create some confusion and could potentially lead to unwanted behavior

I agree this is awkward.

Overall I am not sure the mixin pattern can replace a callback system.

I should clarify, I'm not proposing that it should 😄 My goal was to explore an alternative design pattern (outside of callbacks) that addresses the use case that you presented for custom reporting (ie. "reuse blocks of logic in-between models").

Same for reproducibility, since IMO callbacks should not be needed for reproducibility and so needed when re-instantiating a pl.LightningModule.

This could be dangerous if a callback does alter the training procedure as making it optional would hinder reproducibility.

williamFalcon commented 4 years ago

i like the brainstorm for different approaches. Can we take a second to list out 3 different use cases we want to solve?

xxxx
yyyy
zzzzz

then any brainstorm solution can be sanity checked to see if it solves the 3 cases. I’m sure we’ll have many solutions that do, but then at that point we can focus on making those solutions more usable.

Rodolphe2005 commented 4 years ago

I'm a user of PL and I felt the need of callbacks for my usecases. As it seems you need some usecases, mine is: During training, I save intermediate plots to better understand how the training is going, I compute metrics, I save embeddings, etc.

Currently, I edited my LightningModule but I don't like much that way of doing it because it adds a lot of logic that is not related to the model (saving plots and defining a forward step are quite different stuffs) so it's harder and longer to read my code, I'd prefer to have the logic separated.

I also sometimes change my mind and would like to remove one of those added functionalities and to do that I need to remove the code or I found a slightly better way : I created a class "instructions" which stores how much I want to do this additional stuffs (how often in terms of epochs for example) and this object instructions is used to give instructions on when to do things or not. It adds more code though.

I feel like callbacks could help me better separate logics and also I can easily remove a callback instead of removing the code or using an object to deactivate the callback.

williamFalcon commented 4 years ago

thanks for the use case! I’m wondering why you can’t do the logging and embedding plotting from the on_epoch_end?

Borda commented 4 years ago

feel like I have accidentally opened a Pandora box even at the beginning I wanted to merge the two parameters in Trainer init... :robot:

Rodolphe2005 commented 4 years ago

thanks for the use case! I’m wondering why you can’t do the logging and embedding plotting from the on_epoch_end?

I definitely can. But I still have the problem that if I want to disable this code, I have to delete the related code in the on_epoch_end on the LightningModule instead of "just" removing the callback function from the callbacks list in the trainer.

This kind of extended functionalities that callbacks are should not appear in the LightningModule (in my opinion). It also help for sharing the model, I have some colleagues that don't care about my callbacks but are interested in the model. They could remove the on_epoch_end code but it's not a good practice to activate/deactive something by adding/removing code I think.

Rodolphe2005 commented 4 years ago

For example, suppose that the checkpointing logic would not be a parameter in the Trainer but some code in on_epoch_end. It would be a pain to remove or add this code whenever I want to activate/deactivate checkpointing. Instead, it's in the Trainer call so it doesn't disturb the readability of the LightningModule

williamFalcon commented 4 years ago

got it. i see what you mean now. main takeaways from your example:

this code is not essential to training so its existence in the lightningModule muddles the core research code.
you want an easy way to enable, disable it.

I guess the key is point 1. It sounds like ideally we want to make sure that your model can still run with or without these callbacks.

So we want to factor DL code into three components?

A. hardware/engineering code (trainer) B. core research logic (LightningModule) C. non-essential callbacks.

This means your model should work with or without C?

hadim commented 4 years ago

My use-cases look exactly the same as @Rodolphe2005 and @williamFalcon you just summarize my main point.

C is non-essential.

Ideally, you would prevent callbacks to modify essential things in A and B but I am not sure this is easily feasible.

williamFalcon commented 4 years ago

awesome. So, I think flavors of this have been proposed in this thread by (@jeremyjordan, @ethanwharris, @hadim). Here's another proposal.

MyCallback(pl.Callback):
    def __init__(self, hook_name):
        self.hook_name = hook_name

    def __call__(self, trainer, pl_module):
         # my custom code I want to execute

callback = Callback('on_epoch_end')
Trainer(callbacks=[callback])

Something like this?

hadim commented 4 years ago

What if a callback needs to do things on multiple events such as on_epoch_end and on_test_end for example?

hadim commented 4 years ago

Here is a more concrete example:

import datetime
import pathlib

import matplotlib.pyplot as plt
import pandas as pd

from pytorch_lightning.callbacks import Callback

class HistoryLogger(Callback):
    def __init__(self, dir_path, plot_history_interval=None):
        super().__init__()

        self.dir_path = pathlib.Path(dir_path)
        self.history_path = self.dir_path / "history.csv"
        self.test_path = self.dir_path / "test.csv"

        self.plot_history_interval = plot_history_interval
        self.history_plot_path = self.dir_path / "history.png"

    def on_epoch_end(self):

        logs = self._trainer.model.logs

        if self.history_path.is_file():
            history = pd.read_csv(self.history_path)
        else:
            history = pd.DataFrame()

        history = history.append(pd.DataFrame([logs]), ignore_index=True)
        history.to_csv(self.history_path, index=None)

        epoch = logs["epoch"]
        if self.plot_history_interval and (epoch % self.plot_history_interval) == 0:
            self._plot_history(history, self.history_plot_path)

    def on_test_end(self):
        metrics = {}
        metrics["date"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        metrics["lr"] = self._trainer.model.get_lr()
        metrics.update(self._trainer.model.epoch_metrics["test"])
        metrics = pd.DataFrame([metrics])
        metrics.to_csv(self.test_path, index=None)

    def _plot_history(self, history, plot_history_path):

        # Guess metric names
        metric_names = history.columns
        filter_names_fn = lambda x: x.startswith("val_") and x not in [
            "date",
            "epoch",
        ]
        metric_names = list(filter(filter_names_fn, metric_names))
        metric_names = [name.replace("val_", "") for name in metric_names]

        # Plot parameters
        basesize = 6
        ratio = 0.7
        ncols = 3

        n_plots = len(metric_names)

        nrows = n_plots // ncols
        if n_plots % ncols > 0:
            nrows += 1
        figsize = int(basesize * ncols), int(basesize * ratio * nrows)
        fig, axs = plt.subplots(ncols=ncols, nrows=nrows, figsize=figsize, constrained_layout=True)  # type: ignore
        axs = axs.flatten()

        for metric_name, ax in zip(metric_names, axs):

            train_metric_name = f"train_{metric_name}"
            if train_metric_name in history.columns:
                ax.plot(history["epoch"], history[train_metric_name], label="train")

            val_metric_name = f"val_{metric_name}"
            if val_metric_name in history.columns:
                ax.plot(history["epoch"], history[val_metric_name], label="val")

            ax.set_xlabel("Epoch")
            ax.set_ylabel(metric_name)

            title = f"Metric: {metric_name}"
            ax.set_title(title, fontsize=18)
            ax.legend()

        fig.savefig(plot_history_path, dpi=200)
        plt.close(fig)

williamFalcon commented 4 years ago

Yeah good point. What if a callback has all the hooks as interface and for each callback you can override the hooks.

MyCallback(pl.Callback):
    def on_test_end(self, trainer, pl_module):
         # other stuff

    def on_epoch_end(self, trainer, pl_module):
         # my custom code I want to execute

hadim commented 4 years ago

Good for me.

The question is do we want to pass arguments on those methods or init the callback with something like set_trainer() as it is done here?

I don't have strong feelings about this.

Borda commented 4 years ago

Yeah good point. What if a callback has all the hooks as interface and for each callback you can override the hooks.

i f you derive it from Callback as it is in #849 it shall have all events...

ethanwharris commented 4 years ago

Haven't figured out if it's possible yet but what about something like this

MyCallback(pl.Callback):
    @on_test_end
    def my_function_name(self, trainer, pl_module):
         # other stuff

    @on_epoch_end
    def my_other_function_name(self, trainer, pl_module):
         # my custom code I want to execute

    @on_epoch_end
    def other_thing_to_happen_at_end(self, trainer, pl_module)
        # do something else

This would solve some readability issues since the functions can have informative names. Also potentially more flexible if we can support mutliple functions binding to the same callback hook. Although I'm not sure how possible it is haha

williamFalcon commented 4 years ago

it’s equivalent to what i wrote. the decorators are unnecessary.

def on_epoch_end: fx1() fx2()

ethanwharris commented 4 years ago

it’s equivalent to what i wrote. the decorators are unnecessary.

def on_epoch_end: fx1() fx2()

Good point! I hadn't thought of that :)

williamFalcon commented 4 years ago

Ok, anyone want to take a stab at doing this?

@ethanwharris @hadim @jeremyjordan ?

Lightning-AI / pytorch-lightning