Embedding/Vocab size error when using pretrained models

yunhaoli1995 commented 3 years ago

Hi, in order to evaluate the generated text with run.py, i have to created the sub-directory output, but i meet a problem when running the following command:

python data_utils.py \
       -mode make_ie_data \
       -input_path rotowire/json \
       -output_fi rotowire/output/training-data.h5

An error occurs as follows:

Parsing train:   5%|██████▏                                                                                                         | 186/3398 [00:00<00:10, 316.21it/s]
['The', '76ers', '(', '0', '-', '2', ')', 'were', 'unable', 'to', 'claw', 'their', 'way', 'back', 'into', 'the', 'game', 'once', 'the', 'Hawks', 'started', 'putting', 'together', '100', 'scoring', 'runs', ',', 'they', 'were', 'also', 'unable', 'find', 'a', 'field', 'goal', 'themselves', ',', 'as', 'the', 'brutal', 'period', 'saw', 'two', 'four', 'minutes', 'field', 'goal', 'droughts', 'to', 'open', 'and', 'close', 'the', 'second', '.']
['two', 'four']
Parsing train:   6%|██████▎                                                                                                         | 192/3398 [00:00<00:10, 299.24it/s]
Traceback (most recent call last):
  File "/home/liyunhao.19950730/src/rotowire-rg-metric/data_utils.py", line 153, in extract_numbers
    sent_nums.append((i, i+j, text2num(" ".join(sent[i:i+j]))))
  File "/home/liyunhao.19950730/src/rotowire-rg-metric/text2num.py", line 380, in text2num
    raise NumberException("{!r} may not proceed "
text2num.NumberException: 'four' may not proceed 'two'

I download the json file into the json sub-directory from the original github of Challenges in Data-to-Document Generation. Do you have any idea?

Thanks

KaijuML commented 3 years ago

Hi,

Thanks for reporting this issue! Let me know if I'm mistaken, but it seems that the actual error is at line 157 due to the assert False statement?

In this case, I will push a fix and simply remove the assert statement that's causing the trouble. This bugged line comes directly from the original repo, so I cannot be sure 100% why it's included. I can only assume that it was either debugging code or was never raised (it's actually possible, since not one of the generated texts I've evaluated my-self using this repo has raised the issue).

Note that this fix is also used by Ratish Puduppully's code in his fork, so I'm confident in applying this change here.

Can you pull the updated code and let me know if this issue is resolved?

Thanks, Clément

yunhaoli1995 commented 3 years ago

Hi,

Thanks for reporting this issue! Let me know if I'm mistaken, but it seems that the actual error is at line 157 due to the assert False statement?

In this case, I will push a fix and simply remove the assert statement that's causing the trouble. This bugged line comes directly from the original repo, so I cannot be sure 100% why it's included. I can only assume that it was either debugging code or was never raised (it's actually possible, since not one of the generated texts I've evaluated my-self using this repo has raised the issue).

Note that this fix is also used by Ratish Puduppully's code in his fork, so I'm confident in applying this change here.

Can you pull the updated code and let me know if this issue is resolved?

Thanks, Clément

Thanks for your reply, this issue doesn't occurs when remove the assert statement. But another issue occurs when i compute RG scores and generate the list of extracted records using the command: python run.py \ --just-eval \ --datafile $ROTOWIRE/output/training-data.h5 \ --preddata $ROTOWIRE/output/prep_predictions.h5 \ --eval-models $ROTOWIRE/models \ --gpu 0 \ --test \ --ignore-idx 15 \ --vocab-prefix $ROTOWIRE/output/training-data

The error is as follows: xception has occurred: IndexError (note: full exception trace is shown but execution is paused at: _run_module_as_main) index out of range in self

I went to the debug mode and find that the embdding size of the model is smaller than the token ids. The embdding size of the model is 4934, while the vocab size is 5395. I think there are something wrong with the download dataset. Can you tell me how did you make the json directory?

KaijuML commented 3 years ago

I think I know what's happening and it is not due to a mistake on your end.

The models I shared were trained using an earlier version of the code, which has slightly changed and now the extracted vocabulary is not the exact same. When I run the step to build the training-data.h5 file, I get the same vocabulary size as you (5395), which differs from the older files that I have locally.

I don't have time right now to re-train models with the updated version. I will train them as soon as possible and will let you know once they are available. In the mean time, you can follow instructions to train you own models.

If you can wait, you can expect models available for download in the following days.

Bests, Clément

yunhaoli1995 commented 3 years ago

I think I know what's happening and it is not due to a mistake on your end.

The models I shared were trained using an earlier version of the code, which has slightly changed and now the extracted vocabulary is not the exact same. When I run the step to build the training-data.h5 file, I get the same vocabulary size as you (5395), which differs from the older files that I have locally.

I don't have time right now to re-train models with the updated version. I will train them as soon as possible and will let you know once they are available. In the mean time, you can follow instructions to train you own models.

If you can wait, you can expect models available for download in the following days.

Bests, Clément

Thanks for your reply! Now i'm trying to train my own models.

yunhaoli1995 commented 3 years ago

When I evaluate with my own model, the same error occurs again: exception has occurred: IndexError (note: full exception trace is shown but execution is paused at: _run_module_as_main) index out of range in self

After debugging, I find that this time the error is related to the embedding size of entdist, the max entidist of train dataset is 191 while the max entidist of test set is 195. Besides, the max numdist of test set is larger than train dataset too. Now, my solution is to manually add 10 to the embedding size of entdist and numdist before training the model:

nlabels = train['labels'].max().item() + 1
ent_dist_pad = train['entdists'].max() + 10
num_dist_pad = train['numdists'].max() + 10
word_pad = train['sents'].max() + 1

These codes are from data.py

KaijuML commented 3 years ago

I am having the same error. I need to find out which commit introduced the bug, and fix it. It will take some time, I will let you know once everything is back to normal.

Thanks for your help in this!

KaijuML commented 3 years ago

I have found the origin of the bug, which is entirely my fault, and was not introduced by a commit! It's fixed now, and you should be able to train and use your trained models without changing anything in the code.

If you are curious, refer to line 81 of the original code where the order of operations is clamp test then shift train/val/test whereas I previously shifted train before clamping test. It resulted in a slight mismatch which was not always problematic, hence why I didn't even notice it before now.

Note that I have also added something so that there is more consistency across runs: previously, running data_utils.py to create training data was not deterministic, and train-data.labels were in random order. Now, they should always be in the same order!

As a sanity check, you can check that the firt line of training-data.labels is None 1 and that training-data.dict has 5395 lines, and its line 5394 is Celitcs 5394 (last line should be UNK).

The issue should be resolved, but I'm leaving this open until I find time to train models and upload them. Let me know if there are other issues (if there is an issue unrelated to this one, please open a new issue).

Thanks again, Clément

KaijuML commented 3 years ago

Hi,

I have trained 6 new models, and everything seems to be working fine on my end.

I am closing this issue, feel free to reopen if needed.

Have a nice day, Clément

yunhaoli1995 commented 3 years ago

Hi,

I have trained 6 new models, and everything seems to be working fine on my end.

I am closing this issue, feel free to reopen if needed.

Have a nice day, Clément

Thank you so much！

KaijuML / rotowire-rg-metric

Embedding/Vocab size error when using pretrained models #2