configuration of GPU machine for training?

weicheng113 commented 2 years ago

Hi Emma,

Thanks for sharing the detailed code implementation. I am doing some study of your paper, which looked very interesting. May I ask what kind of the GPU machine configuration you used for training and how long roughly did it take you to train the best tpc model with eICU data set? I am trying to train the model on AWS ml.p3.2xlarge NVIDIA V100 with 16GB GPU with eICU data set. I noticed the GPU utilitization is pretty low when I inspect with 'nvidia-smi'(I set batch_size to 64 to occupy about 11GB of GPU memory). It looked the percentage of GPU usage fluctuated a lot back to 0% and most of time was not using above 80%.

Thanks, Cheng

EmmaRocheteau commented 2 years ago

Hello! Thank you for your question :) I've just checked a couple of my experiment logs and for the best_tpc model for the LoS task alone on eICU it took just under 12 hours to train. This was on one NVIDIA TITAN Xp GPU. I think the reason why it fluctuates on memory use is things get transferred to the CPU from time to time for metric processing etc. but my memory is a little hazy. It is very possible (probable in fact!) that it is not the most efficient implementation it could be in terms of the computation time! I really hope that's helpful. Let me know if there is anything else I can do to help you

weicheng113 commented 2 years ago

Thanks Emma for the quick response. I am going to do a detailed learning on your source code. I can see there is a lot of comments, which are very helpful for me. So far it has been quite smooth for my learning.

The following training is using ml.g4dn.xlarge with 16GB. For 14 epoches, I guess it took less than 8 hours. But I got error for final epoch below. Do you know the reason for the error(I added some comments to tpc_model.py, so the line number will not be the same as your original code)? If not, I will try to debug myself. Thanks.

Experiment started.
Done epoch 0
Done epoch 1
Done epoch 2
Done epoch 3
Done epoch 5
Done epoch 7
Done epoch 8
Done epoch 9
Done epoch 10
Done epoch 11
Done epoch 13
Done epoch 14
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/trixi/experiment/experiment.py", line 90, in run
    self.validate(epoch=self._epoch_idx)

  File "/home/ec2-user/SageMaker/DL4H-Project/TPC-LoS-prediction/models/experiment_template.py", line 221, in validate
    self.test()

  File "/home/ec2-user/SageMaker/DL4H-Project/TPC-LoS-prediction/models/experiment_template.py", line 254, in test
    y_hat_los, y_hat_mort = self.model(padded, diagnoses, flat)

  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/ec2-user/SageMaker/DL4H-Project/TPC-LoS-prediction/models/tpc_model.py", line 626, in forward
    diagnoses_enc = self.relu(self.main_dropout(self.bn_diagnosis_encoder(self.diagnosis_encoder(diagnoses))))  # B * diagnosis_size

  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/ec2-user/SageMaker/DL4H-Project/TPC-LoS-prediction/models/tpc_model.py", line 78, in forward
    training=True, momentum=exponential_average_factor, eps=self.eps)  # set training to True so it calculates the norm of the batch

  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/functional.py", line 2280, in batch_norm
    _verify_batch_size(input.size())

  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/functional.py", line 2248, in _verify_batch_size
    raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size))

ValueError('Expected more than 1 value per channel when training, got input size torch.Size([1, 64])',)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-f696031b3a88> in <module>
----> 1 run_best_tpc()

~/SageMaker/DL4H-Project/TPC-LoS-prediction/main_tpc.py in run_best_tpc()
     52               base_dir=log_folder_path,
     53               explogger_kwargs={'folder_format': '%Y-%m-%d_%H%M%S{run_number}'})
---> 54     tpc.run()
     55 
     56 

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/trixi/experiment/experiment.py in run(self, setup)
    106             self._exp_state = "Error"
    107             self._time_end = time.strftime("%y-%m-%d_%H:%M:%S", time.localtime(time.time()))
--> 108             self.process_err(e)
    109 
    110     def run_test(self, setup=True):

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/trixi/experiment/pytorchexperiment.py in process_err(self, e)
    389             self.elog.text_logger.log_to("\n".join(traceback.format_tb(e.__traceback__)), "err")
    390             self.elog.text_logger.log_to(repr(e), "err")
--> 391         raise e
    392 
    393     def update_attributes(self, var_dict, ignore=()):

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/trixi/experiment/experiment.py in run(self, setup)
     88             while self._epoch_idx < self.n_epochs and not self.__stop:
     89                 self.train(epoch=self._epoch_idx)
---> 90                 self.validate(epoch=self._epoch_idx)
     91                 self._end_epoch_internal(epoch=self._epoch_idx)
     92                 self._epoch_idx += 1

~/SageMaker/DL4H-Project/TPC-LoS-prediction/models/experiment_template.py in validate(self, epoch, mort_pred_time)
    219 
    220         elif self.config.mode == 'test' and epoch == self.n_epochs - 1:
--> 221             self.test()
    222 
    223         if epoch == self.n_epochs - 1 and self.config.save_results_csv:

~/SageMaker/DL4H-Project/TPC-LoS-prediction/models/experiment_template.py in test(self, mort_pred_time)
    252                 padded, mask, diagnoses, flat, los_labels, mort_labels, seq_lengths = batch
    253 
--> 254             y_hat_los, y_hat_mort = self.model(padded, diagnoses, flat)
    255             loss = self.model.loss(y_hat_los, y_hat_mort, los_labels, mort_labels, mask, seq_lengths, self.device,
    256                                    self.config.sum_losses, self.config.loss)

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

~/SageMaker/DL4H-Project/TPC-LoS-prediction/models/tpc_model.py in forward(self, X, diagnoses, flat, time_before_pred)
    624                                      next_X[:, :, time_before_pred:].permute(0, 2, 1).contiguous().view(B * (T - time_before_pred), -1)), dim=1)  # (B * (T - time_before_pred)) * (((F + Zt) * (1 + Y)) + no_flat_features) for tpc
    625         else:
--> 626             diagnoses_enc = self.relu(self.main_dropout(self.bn_diagnosis_encoder(self.diagnosis_encoder(diagnoses))))  # B * diagnosis_size
    627             combined_features = cat((flat.repeat_interleave(T - time_before_pred, dim=0),  # (B * (T - time_before_pred)) * no_flat_features
    628                                      diagnoses_enc.repeat_interleave(T - time_before_pred, dim=0),  # (B * (T - time_before_pred)) * diagnosis_size

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

~/SageMaker/DL4H-Project/TPC-LoS-prediction/models/tpc_model.py in forward(self, input)
     76         return F.batch_norm(
     77             input, self.running_mean, self.running_var, self.weight, self.bias,
---> 78             training=True, momentum=exponential_average_factor, eps=self.eps)  # set training to True so it calculates the norm of the batch
     79 
     80 

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/functional.py in batch_norm(input, running_mean, running_var, weight, bias, training, momentum, eps)
   2278         )
   2279     if training:
-> 2280         _verify_batch_size(input.size())
   2281 
   2282     return torch.batch_norm(

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/functional.py in _verify_batch_size(size)
   2246         size_prods *= size[i + 2]
   2247     if size_prods == 1:
-> 2248         raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size))
   2249 
   2250 

ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 64])

weicheng113 commented 2 years ago

By the way, I guess the experiment does not support multiple GPU, right? as I don't see distributed training related code or the use of pytorch-lightning. Thanks.

EmmaRocheteau commented 2 years ago

Hi! Yeah sorry I didn't work multiple GPUs into the code. I was mainly focused on proving whether the architecture was useful :') hopefully with reliable enough code to make it through however many runs I needed to get the results. I'm sorry I'm not familiar with that error, although I feel like I've seen something similar while I've worked on other projects since. Is the version of pytorch the same as the one I used (1.5)?

Also note that you've trained 15 epochs not 14, since python has weird indexing throughout!

weicheng113 commented 2 years ago

Thanks a lot for your time and reply. It is possible that I used a different version of pytorch, as I was trying to use latest version of libraries. I will have a look.

weicheng113 commented 2 years ago

Hi Emma,

Just let you know the above error was because I had different batch_size(bigger batch_size) and the last batch may only contains one sample, which cannot perform batch normalization. I have fixed it.

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/functional.py in batch_norm(input, running_mean, running_var, weight, bias, training, momentum, eps)
   2278         )
   2279     if training:
-> 2280         _verify_batch_size(input.size())
   2281 
   2282     return torch.batch_norm(

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/nn/functional.py in _verify_batch_size(size)
   2246         size_prods *= size[i + 2]
   2247     if size_prods == 1:
-> 2248         raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size))
   2249 
   2250 

ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 64])

I also modified the data generator to use pytorch dataloader so as to separate the data preprocessing into different process, which also allows multiple workers for data preprocessing and makes better GPU utilization. In this way, I reduced the eICU training for 15 epochs best TPC to less than 5 hours now.

weicheng113 commented 2 years ago

By the way, the reason of using groups in 1D convoluation is to group convolving a field and its corresponding mask field together, right? This is my first time seeing the use of groups, which looks interesting. Thanks.

EmmaRocheteau commented 2 years ago

Ah brilliant! If someone else has this issue I will be sure to let them know. In response to your second question - yes, it's to pair the variables with their corresponding mask. It took quite a bit of fiddling about with and checking to ensure the pairing was correct. Thank you for your interest in the code and the project. I'm very happy you are using it!

weicheng113 commented 2 years ago

Thanks for your confirmation and explanation, Emma. It is very interesting that the model performs better than transformer. I will do a detailed study in the following weeks.

EmmaRocheteau commented 2 years ago

Yes in theory the transformer should have all the tools to do very well! And I think in the limit of infinite data it would outperform TPC because there is theoretically more capacity in the model. However specifically in the niche of big (but limited) EHR data the TPC model does very well :) Looking forward to seeing your results in the coming weeks!

weicheng113 commented 2 years ago

Hi Emma,

Sorry to bother you. Did you get trixi browser to work? I got the following error and I try to find information on trixi github issues, but no one mentioned it before.

  File "C:\Users\weich\Anaconda3\envs\dl4h-project\lib\site-packages\flask\templating.py", line 149, in render_template
    ctx.app.jinja_env.get_or_select_template(template_name_or_list),
  File "C:\Users\weich\Anaconda3\envs\dl4h-project\lib\site-packages\jinja2\environment.py", line 1071, in get_or_select_template
    return self.get_template(template_name_or_list, parent, globals)
  File "C:\Users\weich\Anaconda3\envs\dl4h-project\lib\site-packages\jinja2\environment.py", line 1000, in get_template
    return self._load_template(name, globals)
  File "C:\Users\weich\Anaconda3\envs\dl4h-project\lib\site-packages\jinja2\environment.py", line 959, in _load_template
    template = self.loader.load(self, name, self.make_globals(globals))
  File "C:\Users\weich\Anaconda3\envs\dl4h-project\lib\site-packages\jinja2\loaders.py", line 126, in load
    source, filename, uptodate = self.get_source(environment, name)
  File "C:\Users\weich\Anaconda3\envs\dl4h-project\lib\site-packages\flask\templating.py", line 59, in get_source
    return self._get_source_fast(environment, template)
  File "C:\Users\weich\Anaconda3\envs\dl4h-project\lib\site-packages\flask\templating.py", line 95, in _get_source_fast
    raise TemplateNotFound(template)
jinja2.exceptions.TemplateNotFound: overview.html
127.0.0.1 - - [17/Apr/2022 11:11:09] "GET / HTTP/1.1" 500 -

I tried to look into trixi related source code. It seems it requires some html webpages(e.g., overview.html), but how to get these html files?

# trixi.experiment_browser.browser.py

def register_url_routes(app, base_dir):
    app.add_url_rule("/", "overview", lambda: overview(base_dir), methods=["GET"])
    app.add_url_rule("/overview", "overview_", lambda: overview_(base_dir), methods=["GET"])
    app.add_url_rule('/experiment', "experiment", lambda: experiment(base_dir), methods=['GET'])
    app.add_url_rule('/combine', "combine", lambda: combine(base_dir), methods=['GET'])
    app.add_url_rule('/experiment_log', "experiment_log", lambda: experiment_log(base_dir), methods=['GET'])
    app.add_url_rule('/experiment_plots', "experiment_plots", lambda: experiment_plots(base_dir), methods=['GET'])
    app.add_url_rule('/experiment_remove', "experiment_remove", lambda: experiment_remove(base_dir), methods=['GET'])
    app.add_url_rule('/experiment_star', "experiment_star", lambda: experiment_star(base_dir), methods=['GET'])
    app.add_url_rule('/experiment_rename', "experiment_rename", lambda: experiment_rename(base_dir), methods=['GET'])

def overview(base_dir):
    try:
        base_info = process_base_dir(base_dir, ignore_keys=IGNORE_KEYS)
        base_info["title"] = base_dir
        return render_template("overview.html", **base_info)
    except Exception as e:
        print(e.__repr__())
        raise e
        abort(500)

Thanks

EmmaRocheteau commented 2 years ago

Sorry for the delay! There was a long weekend holiday in the UK and I went away. I checked my trixi installation and I definitely have html files there. I don't know if this is helpful but I will attach my trixi: trixi.zip

You can find them in experiment browser/templates/

weicheng113 commented 2 years ago

Thank you very much, Emma. Ok, I think I know the problem. I can now see the html files sitting under templates. I modified trixi browser to work in microsoft windows machine, which is why it can't not find these template html files. I think trixi expect linux and mac os users. I had to modify line 68 in browser.py to make it work in wondows.

def create_flask_app(base_dir):
  ...
  blueprint = Blueprint("data", __name__, static_url_path='/' + base_dir, static_folder=base_dir)
  # original code below:
  #  blueprint = Blueprint("data", __name__, static_url_path=base_dir, static_folder=base_dir)
  ...

With your hint, I think I should somehow be able to make it work. Thanks.

EmmaRocheteau commented 2 years ago

Sorry about all the teething issues you are having! I will try to be very responsive to any further issues

weicheng113 commented 2 years ago

Thanks Emma. Appreciate your time. There is a lot to learn from your paper. You can reply when you have time.

Zetmas commented 2 years ago

Just let you know the above error was because I had different batch_size(bigger batch_size) and the last batch may only contains one sample, which cannot perform batch normalization. I have fixed it.

Hello Cheng @weicheng113 , would you mind sharing how you fix this problem? I met the same batch size issue. Thanks!

weicheng113 commented 2 years ago

Hi @Zetmas , yes, you just drop the last batch, as the last batch may not be a full batch and there is a chance that it only contain 1 sample. I used pytorch dataloader to drop last batch, but you can somehow manually drop the last one.

EmmaRocheteau / TPC-LoS-prediction

configuration of GPU machine for training? #4