esl-epfl / TEE4EHR

3 stars 1 forks source link

[Bug] No runs in TEE4EHR_unsupervised wandb projects #1

Closed tunajaw closed 5 months ago

tunajaw commented 6 months ago

Hi, I'm currently trying to train the p12-rainbow dataset by following the instructions in the readme file:

python Main.py  -data ./dataset/p12/ -setting raindrop -split 0 -demo -data_label multilabel -wandb -wandb_project TEEDAM_supervised -event_enc 1 -state -mod ml -next_mark 1 -mark_detach 1 -sample_label 1 -user_prefix [H70--TEDA__pp_ml-concat] -time_enc concat -wandb_tag RD75

with my own wandb account. However, I got an error:

Traceback (most recent call last):                                                                                                                        
  File "/home/tunajaw/TEE4EHR/Main.py", line 1493, in <module>                                                                                            
    main()                                                                                                                                                
  File "/home/tunajaw/TEE4EHR/Main.py", line 1427, in main                                                                                                
    q = (df['dataset'] == opt.dataset) & (df['setting'] == opt.setting) & (df['INPUT'] == opt.INPUT) & (                                                  
  File "/home/tunajaw/anaconda3/envs/TEE4EHR/lib/python3.9/site-packages/pandas/core/frame.py", line 3807, in __getitem__                                 
    indexer = self.columns.get_loc(key)
  File "/home/tunajaw/anaconda3/envs/TEE4EHR/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
    raise KeyError(key) from err
KeyError: 'dataset'

After checking, I found a run at my TEE4EHR_supervised project while TEE4EHR_unsupervised doesn't.

Another try is changing -wandb_project to TEEDAM_unspervised. This time, there was indeed a run at TEEDAM_unspervised, but I am still receiving the same error. Results from the generated project.csv showed that the dataframe was empty, that is,

,summary,config,name,path

(checking project.csv when keeping the original setting, the dataframe remained the same.)

Do you know how I can solve this issue? Thanks a lot!

Some modifications so far in Main.py:

tunajaw commented 6 months ago

Hi, after checking the code, The code can run when modified with the following:

I'm not sure if the modifications would make something wrong; if can, could you have a quick glance at it? Thanks a lot!

hojjatkarami commented 5 months ago

Hi @tunajaw.

Thank you for debugging our code. I have fixed the bugs and updated the repo. Could you explain the second bug in valid_epoch()?

Did you manage to train on P12 dataset? This works for me without any problem. Let me know if it works for you.

python Main.py -batch_size 128 -lr 0.01 -weight_decay 0.1 -w_pos_label 0.5 -w_sample_label 100 -w_time 1 -w_event 1 -data /mlodata1/hokarami/tedam/p12/ -setting raindrop -split 0 -demo -data_label multilabel -epoch 50 -per 100 -ES_pat 100 -wandb -wandb_project TEEDAM_supervised -event_enc 0 -state -mod none -next_mark 1 -mark_detach 1 -sample_label 1 -user_prefix [H70--DA__base-concat] -time_enc concat -wandb_tag RD75

This works for me without any problem. Let me know if it works for you.

Kind Regards,

tunajaw commented 5 months ago

Hi @hojjatkarami ,

It would encounter the below bug when it didn't ignore the calculations after the CIF decoder when enc_out.shape[1] < 2:

  - (Testing)   :   0%|          | 0/299 [00:00<?, ?it/s]

  - (Testing)   :  51%|█████     | 152/299 [00:02<00:01, 75.95it/s]

                                                                   

Traceback (most recent call last):
  File "/home/tunajaw/TEE4EHR/Main.py", line 1510, in <module>
    main()
  File "/home/tunajaw/TEE4EHR/Main.py", line 1484, in main
    train(model, opt.trainloader, opt.validloader,
  File "/home/tunajaw/TEE4EHR/Main.py", line 725, in train
    valid_event, valid_type, valid_time, dict_metrics_valid = valid_epoch(
  File "/home/tunajaw/TEE4EHR/Main.py", line 443, in valid_epoch
    log_sum, integral_ = model.event_decoder(
  File "/home/tunajaw/anaconda3/envs/TEE4EHR/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tunajaw/anaconda3/envs/TEE4EHR/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tunajaw/TEE4EHR/transformer/Modules.py", line 248, in forward
    if p.max() > 0.999:
RuntimeError: max(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.
wandb: - 0.022 MB of 0.022 MB uploaded

And here is my bash script to run the code (based on the Quick Start part of the readme file):

data_path=./dataset/p12/
wandb_project=TEEDAM_supervised
wandb_tag=RD70
user_prefix=original
supervised_tag=RD75

python Main.py  -data $data_path -setting raindrop -split 0 -demo -data_label multilabel -wandb -wandb_project $wandb_project -event_enc 1 -state -mod ml -next_mark 1 -mark_detach 1 -sample_label 1 -user_prefix $user_prefix -time_enc concat -wandb_tag $wandb_tag > run.log 2>&1

While the training and testing flow of unsupervised and supervised tasks is OK with no bugs, actually I'm a little bit confused about the usage of the Quick Start part.

hojjatkarami commented 5 months ago

Thank you @tunajaw for being attentive. I have updated the code.

mshavliuk commented 3 months ago

I still get the exact same error as in https://github.com/esl-epfl/TEE4EHR/issues/1#issuecomment-2096241268 even though I'm on eac26d73ba00dc96ab4cae7714e36fadac8d3778 (which I guess should have fixed some errors)

My command:

python Main.py  -data ./data/p19/ -setting raindrop -split 0 -demo -data_label multilabel -wandb -wandb_project TEEDAM_supervised -event_enc 1 -state -mod ml -next_mark 1 -mark_detach 1 -sample_label 1 -user_prefix [H70--TEDA__pp_ml-concat] -time_enc concat -wandb_tag RD75

Error:

Traceback (most recent call last):
  File "/home/user/projects/tee4ehr/Main.py", line 1493, in <module>
    main()
  File "/home/user/projects/tee4ehr/Main.py", line 1467, in main
    train(model, opt.trainloader, opt.validloader,
  File "/home/user/projects/tee4ehr/Main.py", line 691, in train
    train_event, train_type, train_time, dict_metrics_train = train_epoch(
  File "/home/user/projects/tee4ehr/Main.py", line 297, in train_epoch
    log_sum, integral_ = model.event_decoder(
  File "/home/user/projects/tee4ehr/.venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/projects/tee4ehr/.venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/projects/tee4ehr/transformer/Modules.py", line 250, in forward
    if torch.max(p) > 0.999:
RuntimeError: max(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.

And here is the code fragment causing the error: https://github.com/esl-epfl/TEE4EHR/blob/eac26d73ba00dc96ab4cae7714e36fadac8d3778/transformer/Modules.py#L243-L259

To be honest, I don't understand the meaning of these operations, but under certain conditions p gets size torch.Size([4, 0, 25]). This happens when the size of the second dimension of seq_times and seq_types is 1, and thus dt_seq = (seq_times[:, 1:] - seq_times[:, :-1]) * non_pad_mask[:, 1:] gets empty tensor which then causes the error with torch.max.

So what does it mean when seq_times (aka event_time in train_epoch) has second dimension size equal to 1? Shouldn't such values be filtered out from the training dataset?