Error when running format weight script | broken download link | assertion error during the preprocessing step

KaijuML / dtt-multi-branch

Code for Controlling Hallucinations at Word Level in Data-to-Text Generation (C. Rebuffel, M. Roberti, L. Soulier, G. Scoutheeten, R. Cancelliere, P. Gallinari)

https://arxiv.org/abs/2102.02810

Other

17 stars 2 forks source link

Error when running format weight script | broken download link | assertion error during the preprocessing step #5

Closed juelap closed 3 years ago

juelap commented 3 years ago

I am trying to run the format weights script and I get the following errors.

Initially, I can't download the file that is given through the wget https://datacloud.di.unito.it/index.php/s/KPr9HnbMyNWqRdj/download command because I get 404 not found error.

Moreover, I thought of changing a little the next command so it uses the preprocessed files in data/wikibio and not data/download. The changed command looks like this python3 data/format_weights.py --orig data/wikibio --dest train_weights.txt --strategy thresholds --thresholds 0.4 --normalize --weight_regularization 1 --eos_weights 1 0 and when I execute it, I get the error below:

Writting formatted wieghts to: /workspace/src/repos/dtt-multi-branch/train_weights.txt
Reading orig file. Can take up to a minute.
WARNING: path is data/wikibio but format is txt (by default).
Traceback (most recent call last):
  File "data/format_weights.py", line 240, in <module>
    args.orig, func=lambda x,y: (x, float(y)))
  File "data/format_weights.py", line 239, in <listcomp>
    sent for sent in TaggedFileIterable.from_filename(
  File "/workspace/src/repos/dtt-multi-branch/data/utils.py", line 124, in __getitem__
    return next(self)
  File "/workspace/src/repos/dtt-multi-branch/data/utils.py", line 134, in __next__
    return next(self._iterable)
  File "/opt/conda/lib/python3.7/site-packages/more_itertools/more.py", line 2670, in __next__
    item = next(self._source)
  File "/workspace/src/repos/dtt-multi-branch/data/utils.py", line 173, in read_file
    with open(path, mode='r', encoding='utf8') as f:
IsADirectoryError: [Errno 21] Is a directory: 'data/wikibio'

I am not sure what the next course of action should be.

KaijuML commented 3 years ago

Hi Juela,

Apologies about the broken link, we are investigating and will fix this asap.

Regarding your next issue: did you run the full list of commands in data/README.md? (You can ignore the last command, which optionally removes words based on their hallucination scores. It's only used to reproduce the experiment called stnd_filtered in the paper).

If you've ran all data commands, the last script data/co_occurrence.py should create a file called data/wikibio/train_h.txt which is the entire training set, one word per line with an empty line to separate between training instances. Each word also as an hallucination score. The first few lines should be something like:

walter  0
extra   0
is  0
a   0
german  0
award-winning   0.99803374201573
aerobatic   1
pilot   0.9795918367346939
,   0.9957763625735621
chief   0.9957763625735621
aircraft    0
designer    0

If this is the case, then you can run the following command:

python3 data/format_weights.py --orig data/wikibio.train_h.txt \
                               --dest train_weights.txt \
                               --strategy thresholds \
                               --thresholds 0.4 \
                               --normalize \
                               --weight_regularization 1 \
                               --eos_weights 1 0

You were right when you tried to replace the --orig value, you simply didn't point towards the file.

Hope this helps, I'm leaving the issue open until all is resolved (including the broken link).

Let me know, Clément

juelap commented 3 years ago

Hi Clément,

You are right. When I used the updated command of python3 data/format_weights.py --orig data/wikibio/train_h.txt --dest train_weights.txt --strategy thresholds --thresholds 0.4 --normalize --weight_regularization 1 --eos_weights 1 0, I had no more issues. Thanks!

juelap commented 3 years ago

I now get an assertion error when running the preprocessing step before training through the python3 run_onmt.py --preprocess --config preprocess.cfg command. The error looks like this:

[2021-07-11 14:20:16,606 INFO] Extracting features...
[2021-07-11 14:20:16,606 INFO]  * number of source features: 3.
[2021-07-11 14:20:16,607 INFO]  * number of target features: 0.
[2021-07-11 14:20:16,607 INFO] Building `Fields` object...
[2021-07-11 14:20:16,607 INFO] Building & saving training data...
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/workspace/src/repos/dtt-multi-branch/onmt/bin/preprocess.py", line 52, in process_one_shard
    assert len(src_shard) == len(tgt_shard)
AssertionError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_onmt.py", line 22, in <module>
    preprocess(remaining_args)
  File "/workspace/src/repos/dtt-multi-branch/onmt/bin/preprocess.py", line 293, in main
    preprocess(opt)
  File "/workspace/src/repos/dtt-multi-branch/onmt/bin/preprocess.py", line 273, in preprocess
    'train', fields, src_reader, tgt_reader, align_reader, opt)
  File "/workspace/src/repos/dtt-multi-branch/onmt/bin/preprocess.py", line 200, in build_save_dataset
    for sub_counter in p.imap(func, shard_iter):
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
AssertionError

Do you know what the problem might be? And, shall I open a new issue for this? Thanks :)

KaijuML commented 3 years ago

This error is most likely due to a difference between the number of input tables and output sentences.

The first two lines of preprocess.cfg are:

train_src: "data/wikibio/train_input.txt"
train_tgt: "data/wikibio/train_output.txt"

Can you check that those two files have the same number of lines? Ideally, the train_weights.txt file should also have the same number of lines.

marco-roberti commented 3 years ago

Hi @juelap ,

The download link was not valid anymore, as I mistakenly set up an expiration date. I just updated the README.md file with a working (and unexpiring) link.

Best, Marco

juelap commented 3 years ago

This error is most likely due to a difference between the number of input tables and output sentences.

The first two lines of preprocess.cfg are:
train_src: "data/wikibio/train_input.txt"
train_tgt: "data/wikibio/train_output.txt"
Can you check that those two files have the same number of lines? Ideally, the train_weights.txt file should also have the same number of lines.

I checked and it looks like train_output.txt and train_weights.txt have the same number of lines: 526,575. On the other hand, train_input.txt has one more, 526,576, which is causing the error. I don't know why this is happening. It might be because I am using Python 3.7, and not 3.8. Anyway, I will fix it by removing/adding one line manually. Thanks for the help! :)

juelap commented 3 years ago

Hi @juelap ,

The download link was not valid anymore, as I mistakenly set up an expiration date. I just updated the README.md file with a working (and unexpiring) link.

Best, Marco

Thanks Marco :)

juelap commented 3 years ago

Hi guys,

After checking one more time, it looks like that all input files in wikibio, train_input.txt, valid_input.txt, and test_input.txt have one more line than the respective output files. I resolved the assertion error by simply removing the last line from the input files, because that was not present in the output files. I guess this error happens as I am using Python 3.7 and maybe the error might still happen for lower versions. Since format_wikibio.py file in data creates these files, probably something needs to be changed there to accommodate the error in Python <3.8. Anyway, it is not a big deal :)

For anyone who is using Python 3.7 or lower, you can execute this script and write the filename which last line should be removed, accordingly:

import os 
file_path = '<filename>' #write filename/filepath here
os.system('sed -i "$ d" {0}'.format(file_path))

KaijuML commented 3 years ago

Hi Juela,

I have run once more all commands on my side and did not encounter this additional line issue. I actually don't see how it could be related to the python version, but if you managed to make it work all the better.

To be extra sure, I have the following size for each set:

train=582647
valid=72830
test=72831

I'm closing this issue now, let us know if (when?) an other issue arises!