cloning repo and running TrAISformer.py in environment leads to error

flxshk commented 2 years ago

@dnguyengithub

I have cloned this repo and created the environment from the requirements.yml file. Now I would like to verify that the model is running with the data that is stored in this repo. As suggested in ReadMe file, I ran python trAISformer.pybut that leads to following error:

python trAISformer.py
======= Directory to store trained models: ./results/ct_dma-pos-pos_vicinity-10-40-blur-True-False-2-1.0-data_size-250-270-30-72-embd_size-256-256-128-128-head-8-8-bs-32-lr-0.0006-seqlen-18-120/
Loading ./data/ct_dma/ct_dma_train.pkl...
10605 9144
Length: 9144
Creating pytorch dataset...
Loading ./data/ct_dma/ct_dma_valid.pkl...
1481 1291
Length: 1291
Creating pytorch dataset...
Loading ./data/ct_dma/ct_dma_test.pkl...
1593 1453
Length: 1453
Creating pytorch dataset...
2022-01-07 11:25:50,442 - models - number of parameters: 5.742055e+07
  0%|                                                                                                                                                                                       | 0/286 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "trAISformer.py", line 122, in <module>
    trainer.train()
  File "/home/jupyter/TrAISformer/trainers.py", line 255, in train
    run_epoch('Training',epoch=epoch)
  File "/home/jupyter/TrAISformer/trainers.py", line 169, in run_epoch
    masks = masks[:,:-1].to(cf.device)
NameError: name 'cf' is not defined

When adding the following code snippet from trAISformer.py to trainers.py, the model starts training.

import models, trainers, datasets, utils
from config_trAISformer import Config
cf = Config()
TB_LOG = cf.tb_log
if TB_LOG:
    from torch.utils.tensorboard import SummaryWriter
    tb = SummaryWriter()

BUT after the first epoch, I get the following error, that I don't know how to resolve. Are there any suggestions on how to overcome this to train the model just as it is in the GitHub repo?

python trAISformer.py
======= Directory to store trained models: ./results/ct_dma-pos-pos_vicinity-10-40-blur-True-False-2-1.0-data_size-250-270-30-72-embd_size-256-256-128-128-head-8-8-bs-32-lr-0.0006-seqlen-18-120/
Loading ./data/ct_dma/ct_dma_train.pkl...
10605 9144
Length: 9144
Creating pytorch dataset...
Loading ./data/ct_dma/ct_dma_valid.pkl...
1481 1291
Length: 1291
Creating pytorch dataset...
Loading ./data/ct_dma/ct_dma_test.pkl...
1593 1453
Length: 1453
Creating pytorch dataset...
2022-01-07 11:34:27,677 - models - number of parameters: 5.742055e+07
epoch 1 iter 285: loss 4.37507. lr 5.993205e-04: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 286/286 [02:17<00:00,  2.08it/s]
2022-01-07 11:36:47,478 - root - Training, epoch 1, loss 9.08754, lr 5.993205e-04.
2022-01-07 11:36:54,077 - root - Valid, epoch 1, loss 4.33956.
2022-01-07 11:36:54,078 - root - Best epoch: 001, saving model to ./results/ct_dma-pos-pos_vicinity-10-40-blur-True-False-2-1.0-data_size-250-270-30-72-embd_size-256-256-128-128-head-8-8-bs-32-lr-0.0006-seqlen-18-120/model.pt
Traceback (most recent call last):
  File "trAISformer.py", line 122, in <module>
    trainer.train()
  File "/home/jupyter/TrAISformer/trainers.py", line 280, in train
    seqs, masks, seqlens, mmsis, time_starts =  iter(aisdls["test"]).next()
NameError: name 'aisdls' is not defined

Thanks!

dnguyengithub commented 2 years ago

Hello,

It seems like you got a variable scope problem. I've updated the code, could you try the latest version?

flxshk commented 2 years ago

Hi, thanks for taking swift action. The initial problem was solved. There is a new problem:

python trAISformer.py
======= Directory to store trained models: ./results/ct_dma-pos-pos_vicinity-10-40-blur-True-False-2-1.0-data_size-250-270-30-72-embd_size-256-256-128-128-head-8-8-bs-32-lr-0.0006-seqlen-18-120/
Loading ./data/ct_dma/ct_dma_train.pkl...
10605 9144
Length: 9144
Creating pytorch dataset...
Loading ./data/ct_dma/ct_dma_valid.pkl...
1481 1291
Length: 1291
Creating pytorch dataset...
Loading ./data/ct_dma/ct_dma_test.pkl...
1593 1453
Length: 1453
Creating pytorch dataset...
2022-01-07 13:50:55,638 - models - number of parameters: 5.742055e+07
epoch 1 iter 0: loss 19.52241. lr 5.999915e-04:   0%|                                                                                                                                       | 0/286 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "trAISformer.py", line 122, in <module>
    trainer.train()
  File "/home/jupyter/TrAISformer/trainers.py", line 255, in train
    run_epoch('Training',epoch=epoch)
  File "/home/jupyter/TrAISformer/trainers.py", line 218, in run_epoch
    if TB_LOG:
NameError: name 'TB_LOG' is not defined

This may be fixed by adding the next section to the trainers.py file.. but that's just me babbling...

from config_trAISformer import Config
cf = Config()
TB_LOG = cf.tb_log
if TB_LOG:
    from torch.utils.tensorboard import SummaryWriter
    tb = SummaryWriter()

Then, if I run the program again, I get an error in the plotting section, which seems to be a naming error since I cannot find a reference to the variable name.

python trAISformer.py
======= Directory to store trained models: ./results/ct_dma-pos-pos_vicinity-10-40-blur-True-False-2-1.0-data_size-250-270-30-72-embd_size-256-256-128-128-head-8-8-bs-32-lr-0.0006-seqlen-18-120/
Loading ./data/ct_dma/ct_dma_train.pkl...
10605 9144
Length: 9144
Creating pytorch dataset...
Loading ./data/ct_dma/ct_dma_valid.pkl...
1481 1291
Length: 1291
Creating pytorch dataset...
Loading ./data/ct_dma/ct_dma_test.pkl...
1593 1453
Length: 1453
Creating pytorch dataset...
2022-01-07 13:52:44,800 - models - number of parameters: 5.742055e+07
epoch 1 iter 285: loss 4.37507. lr 5.993205e-04: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 286/286 [02:13<00:00,  2.14it/s]
2022-01-07 13:55:00,555 - root - Training, epoch 1, loss 9.08754, lr 5.993205e-04.
2022-01-07 13:55:07,031 - root - Valid, epoch 1, loss 4.33956.
2022-01-07 13:55:07,032 - root - Best epoch: 001, saving model to ./results/ct_dma-pos-pos_vicinity-10-40-blur-True-False-2-1.0-data_size-250-270-30-72-embd_size-256-256-128-128-head-8-8-bs-32-lr-0.0006-seqlen-18-120/model.pt
Traceback (most recent call last):
  File "trAISformer.py", line 122, in <module>
    trainer.train()
  File "/home/jupyter/TrAISformer/trainers.py", line 277, in train
    seqs, masks, seqlens, mmsis, time_starts =  iter(aisdls["test"]).next()
NameError: name 'aisdls' is not defined

Thanks!

dnguyengithub commented 2 years ago

Apparently, it's a variable scope problem and not related to the code itself. Which OS, env and compiler are you using? We tested the code on an Ubuntu 18.04 LTS machine, using Anaconda 4.9.1 and we cannot reproduce your error. I'm afraid that there is very little that we can do for you.

flxshk commented 2 years ago

Thanks, I will have a closer look by myself. Appreciate your support.

I am using a Jupyter instance on Google cloud. OS Debian GNU/Linux 10

gcc -v COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/8/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu

Batene commented 2 years ago

I have encountered the same problems.

The aisdls problem I fixed by passing aisdls to Trainer:

in trAISformer.py:


 ## Trainer
    #===============================
    trainer = trainers.Trainer(
        model, aisdatasets["train"], aisdatasets["valid"], aisdls, cf, savedir=cf.savedir, device=cf.device)

in trainer.py:


class Trainer:

    def __init__(self, model, train_dataset, test_dataset, aisdls, config, savedir=None, device=torch.device("cpu")):
        self.train_dataset = train_dataset
        self.test_dataset = test_dataset
        self.config = config
        self.savedir = savedir

        self.device = device
        self.model = model.to(device)
        self.aisdls = aisdls

Furthermore, I had to add the following two imports in trainer.py

import matplotlib.pyplot as plt
import os

CIA-Oceanix / TrAISformer

cloning repo and running TrAISformer.py in environment leads to error #2