Major modeling refactoring

fedebotu commented 4 months ago

Description

This PR is for a major, long-due refactoring to the RL4CO codebase :smile:

Motivation and Context

So far, we had mostly overfitted RL4CO to the autoregressive Attention Model structure (encoder-decoder). However, there are several models that do not necessarily follow this, such as DeepACO. Implementing such a model requires changes in the structure, which then starts to become non-standardized anymore, and it could be hard for newcomers to implement a different model type. For this reason, some rethinking of the library on the modeling side is necessary!

[!TIP] Note that in RL4CO we refer to model as the RL algorithm and policy as the neural network that given an instance gives back a sequence of actions $\pi_0, \pi_1, \dots, \pi_N$., i.e. the solution. In other words: model is a LightningModule that trains the policy which is a nn.Module.

New structure

With the new structure, the aim is to categorize NCO approaches (which are not necessarily trained with RL!) into the following: 1) constructive, 2) improvement, 3) transductive.

1) Constructive (policy)

Input: instance
Output: solution Constructive NCO pre-train a policy to amortize the inference. "Constructive" means that a solution is created from scratch by the model. We can also categorize constructive NCO in two sub-categories depending on the role of encoder and decoder:

1a) Autoregressive (AR)

Autoregressive approaches use a decoder that outputs log probabilities for the current solution. These approaches generate a solution step by step, similar to e.g. LLMs. They have an encoder-decoder structure (i.e. AM). Some models may not have an encoder at all and just re-encode at each step (e.g. BQ-NCO).

1b) NonAutoregressive (NAR)

The difference between AR and NAR approaches is that NAR only use an encoder (they just encode in one shot) and generate for example a heatmap, which can then be decoded simply by using it as a probability distribution or by using some search method on top (e.g. DeepACO).

2) Improvement (policy)

Input: instance, current solution
Output: improved solution

These methods differ w.r.t. constructive NCO since they can obtain better solutions similarly to how local search algorithms work - they can improve the solutions over time. This is different from decoding strategies or similar in constructive methods since these policies are trained for performing improvement operations.

Note: You may have a look here for the basic constructive NCO policy structure! ;)

3) Transductive (model)

Input: instance, (parameters $\theta$)
Output: solution, (updated $\theta^*$)

[!TIP] Read the definition of inductive vs transductive RL. In inductive RL, we train to generalize to new instances. In transductive RL we train (or finetune) to solve only specific ones.

Transductive models are learning algorithms that optimize on a specific instance: they improve solutions by updating policy parameters $\theta$_, which means that we are running optimization (backprop) during online testing. Transductive learning can be performed with different policies: for example EAS updates (a part of) AR policies parameters to obtain better solutions, but I guess there are ways (or papers out there I don't know of) that optimize at test time.

Category	Input	Output	Description
Constructive	Instance	Solution	Amortized policy generates solutions from scratch. Can be categorized into Autoregressive (AR) and NonAutoregressive (NAR) approaches.
Improvement	Instance, Current Solution	Improved Solution	Policies trained to improve existing solutions iteratively, akin to local search algorithms. Different from constructive methods as they focus on refining solutions rather than generating them from scratch.
Transductive	Instance, (Parameters)	Solution, (Updated Parameters)	Updates policy parameters during online testing to improve solutions. Can utilize various policies for optimization, such as EAS updates for AR policies.

In practice, here is what the structure looks right now:

rl4co/
└── models/
    ├── common/
    │   ├── constructive/
    │   │   ├── base.py 
    │   │   ├── autoregressive/
    │   │   │   ├── encoder.py
    │   │   │   ├── decoder.py
    │   │   │   └── policy.py
    │   │   └── nonautoregressive/
    │   │       ├── encoder.py
    │   │       ├── decoder.py
    │   │       └── policy.py
    │   ├── improvement/
    │   │   └── base.py # TBD
    │   └── transductive/
    │       └── base.py
    ├── nn # generic neural network
    ├── rl # generic RL models
    └── zoo # literature

Changelog

[Major!] New structure: constructive, improvement, ~search~ transductive*!
Standardize embedding_dim -> embed_dim (see PyTorch
Policies now do not require env_name as a mandatory parameter
Add new decoding strategy: evaluate which simply takes in an action if provided and gets it log probs
Remove evaluate_action since it can be simply done via the above!
Add entropy calculation as operation
Add decoder hook
Minor cleanups

Types of changes

[x] New feature (non-breaking change which adds core functionality)
[x] Breaking change (fix or feature that would cause existing functionality to change)
[x] Documentation (update in the documentation)
[x] Example (update in the folder of examples)

TODO

[x] Docstrings
[x] Documentation

Extra

[x] Standardize critic as policy.encoder + value_head (this way any model should be able to have a critic)

Special thanks to @LTluttmann for your help and feedback~

Do you have some ideas / feedback on the above PR? CC: @Furffico @henry-yeh @ahottung @bokveizen Also tagging @yining043 for the coming improvement methods

fedebotu commented 4 months ago

Talking to some, it seems that the naming "Transductive" instead of "Search", since search is too broad in scope and the line is a bit blurred in what each algorithm specifically does. Transductive means "directly optimize the parameters specifically for an instance" which conveys the meaning more easily!

bokveizen commented 4 months ago

Talking to some, it seems that the naming "Transductive" instead of "Search", since search is too broad in scope and the line is a bit blurred in what each algorithm specifically does. Transductive means "directly optimize the parameters specifically for an instance" which conveys the meaning more easily!

Yep! I remember you mentioned this before, and that was what I used :-)

fedebotu commented 4 months ago

I noticed doing the metaclasses that NonAutoregressive[...] things are directly callable. We should modify such that the GNN model belongs to zoo and it will be called from there

cbhua commented 4 months ago

A quick abstract look to the current RL4CO structure. rl4co_quick_look

fedebotu commented 4 months ago

A quick abstract look to the current RL4CO structure.

Nice! Careful though because "Transductive" are RL algorithms to "finetune" policies on specific instances, like EAS

fedebotu commented 4 months ago

[!IMPORTANT] Thanks for your revisions! We are planning to merge the PR into main tomorrow - if you have some additional comments / modification / bugfixes please let us know!

ai4co / rl4co