bzhangGo / sltunet

SLTUNET: A Simple Unified Model for Sign Language Translation (ICLR 2023)
27 stars 6 forks source link

Loading pretrained model error #5

Open EyjafjalIa opened 7 months ago

EyjafjalIa commented 7 months ago

Hey, I'm coming again! When I do Step 3. Train SLTUnet Model, I moved required files in two folders in train.sh file and run train.sh. When the code run to loading pretrained model, I got a warning below:

INFO:tensorflow:Trying restore pretrained parameters
WARNING:tensorflow:No Existing Model detected
INFO:tensorflow:Trying restore existing parameters
WARNING:tensorflow:No Existing Model detected

How can I load pretrained model? Is pretrained model trained in Step 2? Thanks! This is my train.sh

data=preprocessed-corpus/
feature=smkd-sign-features/

python3 run.py --mode train --parameters=\
hidden_size=256,embed_size=256,filter_size=4096,\
sep_layer=0,num_encoder_layer=6,num_decoder_layer=6,\
ctc_enable=True,ctc_alpha=0.3,ctc_repeated=True,\
src_bpe_dropout=0.2,tgt_bpe_dropout=0.2,bpe_dropout_stochastic_rate=0.6,\
initializer="uniform_unit_scaling",initializer_gain=0.5,\
dropout=0.3,label_smooth=0.1,attention_dropout=0.3,relu_dropout=0.5,residual_dropout=0.4,\
max_len=256,max_img_len=512,batch_size=80,eval_batch_size=32,\
token_size=1600,batch_or_token='token',beam_size=8,remove_bpe=True,decode_alpha=1.0,\
scope_name="transformer",buffer_size=50000,data_leak_ratio=0.1,\
img_feature_size=1024,img_aug_size=11,\
clip_grad_norm=0.0,\
num_heads=4,\
process_num=2,\
lrate=1.0,\
estop_patience=100,\
warmup_steps=4000,\
epoches=5000,\
update_cycle=16,\
gpus=[0],\
disp_freq=1,\
eval_freq=500,\
sample_freq=100,\
checkpoints=5,\
best_checkpoints=10,\
max_training_steps=30000,\
nthreads=8,\
beta1=0.9,\
beta2=0.998,\
random_seed=1234,\
src_codes="$data/ende.bpe",tgt_codes="$data/ende.bpe",\
src_vocab_file="$data/vocab.zero.drop",\
tgt_vocab_file="$data/vocab.zero.drop",\
img_train_file="$feature/train.h5",\
src_train_file="$data/train.bpe.en.shuf",\
tgt_train_file="$data/train.bpe.de.shuf",\
img_dev_file="$feature/dev.h5",\
src_dev_file="$data/dev.bpe.en",\
tgt_dev_file="$data/dev.bpe.de",\
img_test_file="$feature/test.h5",\
src_test_file="$data/test.bpe.en",\
tgt_test_file="$data/test.bpe.de",\
output_dir="train",\
test_output="",\
shared_source_target_embedding=True,\
bzhangGo commented 7 months ago

Hey, the logging information is a little bit confusing here.

The pretrained model here doesn't mean the pretrained sign embeddings, but pretrained SLT model. so it's normal and not a problem. More details are below:

INFO:tensorflow:Trying restore pretrained parameters
WARNING:tensorflow:No Existing Model detected

It tries to restore a separately pretrained SLT model, e.g. pretrained encoders or decoders, which we never used.

INFO:tensorflow:Trying restore existing parameters
WARNING:tensorflow:No Existing Model detected

It tries to restore from existing working directory. If your job got corrupted, it should recover the training from the working directory, i.e. output_dir.

EyjafjalIa commented 7 months ago

Oh! I'm sorry, loading pretrained model may not the important problem. The original error seems like h5 file.

Traceback (most recent call last):
  File "/data1/wanjiarui/anaconda3/envs/slt/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/data1/wanjiarui/sltunet/utils/queuer.py", line 125, in run
    for data_chunk in self._data_chunk_iterable:
  File "/data1/wanjiarui/sltunet/data.py", line 201, in batcher
    for data in _handle_buffer(buffer):
  File "/data1/wanjiarui/sltunet/data.py", line 184, in _handle_buffer
    x, s, t, m, mask, spar, img_idx = self.to_matrix(batch, train)
  File "/data1/wanjiarui/sltunet/data.py", line 136, in to_matrix
    new_image = self.img_reader[img_key][()]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/data1/wanjiarui/anaconda3/envs/slt/lib/python3.6/site-packages/h5py/_hl/group.py", line 264, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object '5142_8' doesn't exist)"

When I run Step 2. command 4. I got dev.h5 test.h5 train.h5 and train_(0-9).h5 in path smkd/features and then I combine different training features and move dev/test/train.h5 to path smkd-sign-features/, which wrote in train.sh. Finally, I run the train.sh by command below in root path sltunet/ and I got KeyError above. I have run Step 2. command 4 and combine train.h5 twice. Both of them have KeyError. What should I do to check the problem? I'm really confused. Thanks!

sh example/train.sh
bzhangGo commented 7 months ago

This could be checked by inspecting the source_train_file and the resulted train.h5.

Could you please show a few lines in your train file? and also read train.h5 with h5py and check its keys? there might be some mismatch.

EyjafjalIa commented 7 months ago

After I run command below, I got dev.h5, test.h5, train.h5 and train_(0-9).h5 in sltunet/smkd/features

python main.py --load-weights avg/average.pt --phase features --device 0 --num-feature-aug 10 --work-dir exp/resnet34 --config baseline.yaml

This is my sign_featurecmb.py file. Should I combine train.h5 and train(0-9).h5 in a new h5 file or only combine train_(0-9).h5? I guess this line writer = h5py.File('train.h5', 'w') may overwrite train.h5 because I run this python script on the same path of those h5 files and finally I lose some data.

import sys
import glob
import h5py

files = glob.glob(sys.argv[1])
print(files)
writer = h5py.File('train.h5', 'w')

for i, f in enumerate(files):
    reader = h5py.File(f, 'r')
    for key in list(reader.keys()):
        writer.create_dataset("%s_%s" % (key, i), data=reader[key][()])
    reader.close()

writer.close()
bzhangGo commented 7 months ago

could you please list some keys from your train.h5? e.g. 5142_8 is missing based on the error, then could you please take a look what keys for 5142 are contained in your training data?

EyjafjalIa commented 7 months ago

I have solved this error. It happens when I run sign_featurecmb.py on the same path of train.h5 and train(0-9).h5. I show my path below. When the script runs to writer = h5py.File('train.h5', 'w'), it open a file train.h5 with mode write. It may clean train.h5 file if exist on the path of script and write new content. I change the line to writer = h5py.File('train123.h5', 'w') . After the script finished, I move to right path and rename it to train.h5. My script path is:

smkd/features
├── dev.h5
├── test.h5
└── train
    ├── sign_feature_cmb.py
    ├── train_0.h5
    ├── train_1.h5
    ├── train_2.h5
    ├── train_3.h5
    ├── train_4.h5
    ├── train_5.h5
    ├── train_6.h5
    ├── train_7.h5
    ├── train_8.h5
    ├── train_9.h5
    └── train.h5

1 directory, 12 files
EyjafjalIa commented 7 months ago

When I follow the instruction below in sltunet/example, I can't get right combined train.h5 file because that after I run Step 2. extract sign features, I got directory below.

python sign_feature_cmb.py train\*h5 

Directory after extract:

smkd/features
├── dev.h5
├── test.h5
├── train_0.h5
├── train_1.h5
├── train_2.h5
├── train_3.h5
├── train_4.h5
├── train_5.h5
├── train_6.h5
├── train_7.h5
├── train_8.h5
├── train_9.h5
└── train.h5