Closed flxshk closed 2 years ago
Hello,
It seems like you got a variable scope problem. I've updated the code, could you try the latest version?
Hi, thanks for taking swift action. The initial problem was solved. There is a new problem:
python trAISformer.py
======= Directory to store trained models: ./results/ct_dma-pos-pos_vicinity-10-40-blur-True-False-2-1.0-data_size-250-270-30-72-embd_size-256-256-128-128-head-8-8-bs-32-lr-0.0006-seqlen-18-120/
Loading ./data/ct_dma/ct_dma_train.pkl...
10605 9144
Length: 9144
Creating pytorch dataset...
Loading ./data/ct_dma/ct_dma_valid.pkl...
1481 1291
Length: 1291
Creating pytorch dataset...
Loading ./data/ct_dma/ct_dma_test.pkl...
1593 1453
Length: 1453
Creating pytorch dataset...
2022-01-07 13:50:55,638 - models - number of parameters: 5.742055e+07
epoch 1 iter 0: loss 19.52241. lr 5.999915e-04: 0%| | 0/286 [00:03<?, ?it/s]
Traceback (most recent call last):
File "trAISformer.py", line 122, in <module>
trainer.train()
File "/home/jupyter/TrAISformer/trainers.py", line 255, in train
run_epoch('Training',epoch=epoch)
File "/home/jupyter/TrAISformer/trainers.py", line 218, in run_epoch
if TB_LOG:
NameError: name 'TB_LOG' is not defined
This may be fixed by adding the next section to the trainers.py file.. but that's just me babbling...
from config_trAISformer import Config
cf = Config()
TB_LOG = cf.tb_log
if TB_LOG:
from torch.utils.tensorboard import SummaryWriter
tb = SummaryWriter()
Then, if I run the program again, I get an error in the plotting section, which seems to be a naming error since I cannot find a reference to the variable name.
python trAISformer.py
======= Directory to store trained models: ./results/ct_dma-pos-pos_vicinity-10-40-blur-True-False-2-1.0-data_size-250-270-30-72-embd_size-256-256-128-128-head-8-8-bs-32-lr-0.0006-seqlen-18-120/
Loading ./data/ct_dma/ct_dma_train.pkl...
10605 9144
Length: 9144
Creating pytorch dataset...
Loading ./data/ct_dma/ct_dma_valid.pkl...
1481 1291
Length: 1291
Creating pytorch dataset...
Loading ./data/ct_dma/ct_dma_test.pkl...
1593 1453
Length: 1453
Creating pytorch dataset...
2022-01-07 13:52:44,800 - models - number of parameters: 5.742055e+07
epoch 1 iter 285: loss 4.37507. lr 5.993205e-04: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 286/286 [02:13<00:00, 2.14it/s]
2022-01-07 13:55:00,555 - root - Training, epoch 1, loss 9.08754, lr 5.993205e-04.
2022-01-07 13:55:07,031 - root - Valid, epoch 1, loss 4.33956.
2022-01-07 13:55:07,032 - root - Best epoch: 001, saving model to ./results/ct_dma-pos-pos_vicinity-10-40-blur-True-False-2-1.0-data_size-250-270-30-72-embd_size-256-256-128-128-head-8-8-bs-32-lr-0.0006-seqlen-18-120/model.pt
Traceback (most recent call last):
File "trAISformer.py", line 122, in <module>
trainer.train()
File "/home/jupyter/TrAISformer/trainers.py", line 277, in train
seqs, masks, seqlens, mmsis, time_starts = iter(aisdls["test"]).next()
NameError: name 'aisdls' is not defined
Thanks!
Apparently, it's a variable scope problem and not related to the code itself. Which OS, env and compiler are you using? We tested the code on an Ubuntu 18.04 LTS machine, using Anaconda 4.9.1 and we cannot reproduce your error. I'm afraid that there is very little that we can do for you.
Thanks, I will have a closer look by myself. Appreciate your support.
I am using a Jupyter instance on Google cloud. OS Debian GNU/Linux 10
gcc -v COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/8/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu
I have encountered the same problems.
The aisdls problem I fixed by passing aisdls
to Trainer:
in trAISformer.py:
## Trainer
#===============================
trainer = trainers.Trainer(
model, aisdatasets["train"], aisdatasets["valid"], aisdls, cf, savedir=cf.savedir, device=cf.device)
in trainer.py:
class Trainer:
def __init__(self, model, train_dataset, test_dataset, aisdls, config, savedir=None, device=torch.device("cpu")):
self.train_dataset = train_dataset
self.test_dataset = test_dataset
self.config = config
self.savedir = savedir
self.device = device
self.model = model.to(device)
self.aisdls = aisdls
Furthermore, I had to add the following two imports in trainer.py
import matplotlib.pyplot as plt
import os
@dnguyengithub
I have cloned this repo and created the environment from the requirements.yml file. Now I would like to verify that the model is running with the data that is stored in this repo. As suggested in ReadMe file, I ran
python trAISformer.py
but that leads to following error:When adding the following code snippet from trAISformer.py to trainers.py, the model starts training.
BUT after the first epoch, I get the following error, that I don't know how to resolve. Are there any suggestions on how to overcome this to train the model just as it is in the GitHub repo?
Thanks!