No further output after ``Start from epoch 0``

BatchClayderman commented 4 months ago

Hi, guys. When I launch the Python via python main.py on my Windows, it works well and exits normally. Subsequently, I launch the Python via python main.py --cfg configs/Mamba/peptides-func-EX.yaml wandb.use False on my Windows. After it shows Num parameters: 373018 and Start from epoch 0, there is no more output in 3 hours. After debugging, I know that it is handling train_dict[cfg.train.mode](loggers, loaders, model, optimizer, scheduler). I am confusing about this phenomenon. There are four guesses from my perspective. 1) The training itself has no output. Thus, nothing more will show on the console until it ends. 2) The training itself has output. However, it takes a lot of time to train one epoch. Thus, it looks like that it is stuck. 3) The training is using CPU. 4) The training is abnormal. However, I can hardly know what happens and how to handle this issue. Do you have any ideas? Thank you very much.

scatyf3 commented 1 month ago

here is my output after "Num parameters" and "Start from epoch 0", unfortunately, there is an error in source code

Num parameters: 373018
Start from epoch 0
From v0.10 an `'binary_*'`, `'multiclass_*'`, `'multilabel_*'` version now exist of each classification metric. Moving forward we recommend using these versions. This base metric will still work as it did prior to v0.10 until v0.11. From v0.11 the `task` argument introduced in this metric will be required and the general order of arguments may change, such that this metric will just function as an single entrypoint to calling the three specialized versions.
From v0.10 an `'binary_*'`, `'multiclass_*'`, `'multilabel_*'` version now exist of each classification metric. Moving forward we recommend using these versions. This base metric will still work as it did prior to v0.10 until v0.11. From v0.11 the `task` argument introduced in this metric will be required and the general order of arguments may change, such that this metric will just function as an single entrypoint to calling the three specialized versions.
From v0.10 an `'binary_*'`, `'multiclass_*'`, `'multilabel_*'` version now exist of each classification metric. Moving forward we recommend using these versions. This base metric will still work as it did prior to v0.10 until v0.11. From v0.11 the `task` argument introduced in this metric will be required and the general order of arguments may change, such that this metric will just function as an single entrypoint to calling the three specialized versions.
From v0.10 an `'binary_*'`, `'multiclass_*'`, `'multilabel_*'` version now exist of each classification metric. Moving forward we recommend using these versions. This base metric will still work as it did prior to v0.10 until v0.11. From v0.11 the `task` argument introduced in this metric will be required and the general order of arguments may change, such that this metric will just function as an single entrypoint to calling the three specialized versions.
From v0.10 an `'binary_*'`, `'multiclass_*'`, `'multilabel_*'` version now exist of each classification metric. Moving forward we recommend using these versions. This base metric will still work as it did prior to v0.10 until v0.11. From v0.11 the `task` argument introduced in this metric will be required and the general order of arguments may change, such that this metric will just function as an single entrypoint to calling the three specialized versions.
From v0.10 an `'binary_*'`, `'multiclass_*'`, `'multilabel_*'` version now exist of each classification metric. Moving forward we recommend using these versions. This base metric will still work as it did prior to v0.10 until v0.11. From v0.11 the `task` argument introduced in this metric will be required and the general order of arguments may change, such that this metric will just function as an single entrypoint to calling the three specialized versions.
From v0.10 an `'binary_*'`, `'multiclass_*'`, `'multilabel_*'` version now exist of each classification metric. Moving forward we recommend using these versions. This base metric will still work as it did prior to v0.10 until v0.11. From v0.11 the `task` argument introduced in this metric will be required and the general order of arguments may change, such that this metric will just function as an single entrypoint to calling the three specialized versions.
From v0.10 an `'binary_*'`, `'multiclass_*'`, `'multilabel_*'` version now exist of each classification metric. Moving forward we recommend using these versions. This base metric will still work as it did prior to v0.10 until v0.11. From v0.11 the `task` argument introduced in this metric will be required and the general order of arguments may change, such that this metric will just function as an single entrypoint to calling the three specialized versions.
From v0.10 an `'binary_*'`, `'multiclass_*'`, `'multilabel_*'` version now exist of each classification metric. Moving forward we recommend using these versions. This base metric will still work as it did prior to v0.10 until v0.11. From v0.11 the `task` argument introduced in this metric will be required and the general order of arguments may change, such that this metric will just function as an single entrypoint to calling the three specialized versions.
From v0.10 an `'binary_*'`, `'multiclass_*'`, `'multilabel_*'` version now exist of each classification metric. Moving forward we recommend using these versions. This base metric will still work as it did prior to v0.10 until v0.11. From v0.11 the `task` argument introduced in this metric will be required and the general order of arguments may change, such that this metric will just function as an single entrypoint to calling the three specialized versions.
Traceback (most recent call last):
  File "/path/to/Graph-Mamba/main.py", line 176, in <module>
    train_dict[cfg.train.mode](loggers, loaders, model, optimizer,
  File "/path/to/Graph-Mamba/graphgps/train/custom_train.py", line 223, in custom_train
    perf[0].append(loggers[0].write_epoch(cur_epoch))
  File "/path/to/Graph-Mamba/graphgps/logger.py", line 245, in write_epoch
    task_stats = self.classification_multilabel()
  File "/path/to/Graph-Mamba/graphgps/logger.py", line 146, in classification_multilabel
    'accuracy': reformat(acc(pred_score, true)),
  File "/path/to/Graph-Mamba/graphgps/metric_wrapper.py", line 324, in __call__
    return self.compute(preds, target)
  File "/path/to/Graph-Mamba/graphgps/metric_wrapper.py", line 312, in compute
    x = torch.stack(metric_val)  # PyTorch<=1.9
RuntimeError: stack expects a non-empty TensorList

here is the code of metric_wrapper.py after line 309:

            # Average the metric
            # metric_val = torch.nanmean(torch.stack(metric_val))  # PyTorch1.10
            x = torch.stack(metric_val)  # PyTorch<=1.9
            metric_val = torch.div(torch.nansum(x),
                                   (~torch.isnan(x)).count_nonzero())

In annotation, the code said it only support PyTorch<= 1.10 ? But in requirements_conda.txt, torch's version is 1.13...

scatyf3 commented 1 month ago

UPDATE: this issue is helpful to me, maybe try change torchmetrics to 0.9.3

BatchClayderman commented 1 month ago

UPDATE: this issue is helpful to me, maybe try change torchmetrics to 0.9.3

Thank you for your support. There is still no output with that change applied. :-( It is still confusing that my friends and many other people have followed my tutorial on how to configure their Windows and succeeded in training the model on their Windows while I still failed though I wrote the configuration tutorial and obtained many stars. Anyway, I think I should give up adapting this model on my Windows platform. Maybe there are some potential problems with my Windows. As my friends and other people succeeded, I think the codes here must be correct. Thank you again. I will close this issue. :-)

bowang-lab / Graph-Mamba

No further output after ``Start from epoch 0`` #5