[feature request] save original model in Distributed Data Parallel mode

We are now developing transformer-based algorithms for offline agents. Such algorithms often require a large amount of expert data as demonstration for training. To cope with large-scale data, a common solution is to employ the Distributed Data Parallel (DDP) mode to accelerate training. As discussed in here, after we wrap the model with DDP wrapper, the model parameters saved by calling save_state_dict() cannot be directly loaded by original model. Official torch developers suggest fetching the original module and saving its parameters. Examples can be found here.

The PARL agents must support this feature for both save and restore functions.

PaddlePaddle / PARL

[feature request] save original model in Distributed Data Parallel mode #1022