Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.07k stars 3.36k forks source link

model reference after ddp does not have trained parameters #747

Closed sneiman closed 4 years ago

sneiman commented 4 years ago

The model reference after ddp training does not appear to be a copy of the trained model - but the unaltered model as initialized. Expected behavior is that the model reference following the return from trainer.fit() is the trained model.

I am using lightning 0.6, pytorch 1.4, CUDA 10.2, python 3.6.8, Ubuntu 18.04.2LTS. 14 core, 7 gpu standalone node, no virtual env, no SLURM.

I train the model as always with the following code snippet:

        os.environ['OMP_NUM_THREADS']   = '5'
        os.environ['KMP_AFFINITY']      = "granularity=fine,compact,1,0"
        os.environ['KMP_BLOCKTIME']     = '1'
        os.environ['MASTER_PORT']       = str(args.port+int(log_dec))

        model                           = gaze1(netname, log_path, args.gpus, args.batch_size, args.report_flags, args.log_flags)
        model.desc                      = net_desc
        trainer                         = Trainer(max_epochs=args.epochs, gpus=args.gpus, show_progress_bar=False,  gradient_clip_val=5, default_save_path=log_path, num_sanity_val_steps=0, early_stop_callback=None, distributed_backend='ddp', weights_summary=None)
        model.trainer                   = trainer
        trainer.fit(model)
        # trainer.test()
        trainer.test(model)

The first problem I ran into is that a call to trainer.test() fails as noted elsewhere. So, used trainer.test(model). This produced the same results as expected for an untrained model - though validation statistics are good, and prior experience shows that the test and validation statistics are very similar.

I did a few obvious tests - and it is clear that the parameters of model post trainer.fit() have not been altered, though the parameters on each gpu have been.

Note that trainer.test(model) reloads each gpu as if from scratch. Note that for some reason no checkpoints are being created - just FYI, as I have not investigated this yet.

Borda commented 4 years ago

Hello, could you please edit the description - what is the problem, what is happening and what is the expected behaviour not just paste a link to a script... Thx

sneiman commented 4 years ago

Sorry - I did fill this out properly and did not paste a link - not sure what happened. Will rewrite now.

sneiman commented 4 years ago

I simplifed the code that produces this - I wanted to eliminate some things I was only doing in the model on gpu 0 which cluttered the code.

I ran this both in dp and ddp.

dp works fine, ddp trains fine - including ongoing statistics - but cannot run trainer.test() - it thinks model is None, and cannot runtrainer.test(model) as it resubmits the entire job - and APPARENTLY loses the trained parameters in the restart - test results are like a random network.

I have attached the model for your perusal. I have included a utility module 'sn_utils.py' which has some screen messaging functions, and I did not include the dataset. If you want to try to run it, happy to send it also.

@williamFalcon mentioned in 746 to change back end to dp and run test that way. Happy to try it - how do I change backend for test only? trainer.use_dp=True?

gaze_03c_ddp.py.txt sn_utils.py.txt

sneiman commented 4 years ago

I did try changing the backend directly after training with

trainer.use_ddp = False
trainer.use_dp = True

Fails as originally described for both trainer.test(). trainer.test(model) fails with"

    train_sampler       = torch.utils.data.distributed.DistributedSampler(self.data.trn_dataset, num_replicas=self.num_gpus)
  File "/home/seth/.local/lib/python3.6/site-packages/torch/utils/data/distributed.py", line 34, in __init__
    rank = dist.get_rank()
  File "/home/seth/.local/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 564, in get_rank
    _check_default_pg()
  File "/home/seth/.local/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 193, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

I did try not using a DistributedSampler with the test data set - but the failure is on training. Again, it seems like the whole job is being restarted.

Happy to try anything else you wish.

sneiman commented 4 years ago

I don't think the ddp model is saving checkpoints - which would explain the reloading issue, I expect. I have no indication of why checkpoints are not being saved - there is nothing in the default log path. I will continue to investigate.

sneiman commented 4 years ago

Pretty sure I've identified the cause of this.

I have been using the default callback_checkpoint, which fails as noted above. If I specify a model checkpoint -

        checkpoint_callback = ModelCheckpoint(filepath=log_path, verbose=True, monitor='val_loss', mode='min', prefix='gaze_03c_ddp_')

... essentially identical to the default as detailed in the docs - things work as expected. trainer() has lost track of the model, but trainer.test(model) successfully loads the checkpoint from the exact location passed in ModelCheckpoint() (not from deeper in the log structure as is typical) and runs test() on that trained model. Yay!

I suspect the root of this has been identified in https://github.com/PyTorchLightning/pytorch-lightning/issues/734.

To @williamFalcon's question wrt testing - seems like something is being missed wrt default checkpoint_callback.

I think this should stay open until 734 resolved and testing covers default checkpoint_callback. Not sure if this affects more than ddp.

williamFalcon commented 4 years ago

@sneiman good catch. Mind submitting a PR? Would love the fix

sneiman commented 4 years ago

Happy to dig in. I have a ton of experience building good code - but essentially none on working on an open source project. So, I will need a little guidance how this is supposed to go ...

Do I simply work on the master, and the do a PR? Or do a PR and work on the fork?

And then what? I'm a noob in the protocol for this. A grey haired, long time programming noob! ;)

sneiman commented 4 years ago

This issue has combined 2 items: First is the fact that the trainer declared by the user appears to have lost track of the model when using ddp. Second is in 734 - about losing track of where checkpoints go.

These combined for 2 kinds of 'failures' when calling trainer.test() using ddp. One where trainer.test() fails, and one where trainer.test(model) fails when using a default checkpoint_callback.

I have been working on the first. In essence the call sequence can't find a model that has test_step and test_end defined. I believe I know what's happening and would like some feedback before I submit a PR.

Of course, when using ddp, the useful versions of both the trainer and model are all on gpus. There is nothing gathered back to the original instantiations in the master process. Of course, they could be gathered back - but this essentially means copying the trainer and model from device_ids[0] back to the master process, so that trainer.test() behaves as expected.

It seems to me that any realistic version of this is effectively identical to simply calling trainer.test(model). Which already works fine.

Of course, this a fairly expensive proposition - basically everything gets flushed off the gpus and the job is restarted.

For my purposes, the only useful thing would be to avoid this restarting - which is easily accomplished by adding a flag to Trainer() to run test when training exits.

I have tried this - and it works fine. It uses the parameters of the last epoch - not necessarily the optimum.

So ... I don't have a suggestion on how to actually get the right trainer and model back to master process that is better than simply calling trainer.test(model), and don't think we need one. I personally could use a flag to run trainer.test() after trainer.fit(model) to avoid all the restarting.

Thoughts and criticisms welcome.

seth

PS I will look into the checkpoint location issue separately

Borda commented 4 years ago

Btw, pls avoid using #number, it is automatically linked to the issue/PR of the same number so it may make your comment misleading...

sneiman commented 4 years ago

got it. thx

sneiman commented 4 years ago

I have not been able to find a way to get a trained model reference back to the trainer.model property in the spawning process when using ddp. I have found a way to get a model with the right parameters back to the spawning process. But no way to get it into trainer.model without altering user's code - which is what I believe the objective is.

So closing this - if anyone has a good idea on how to do this, I will be happy to pursue it.