Closed juelap closed 3 years ago
Hi Juela,
Apologies about the broken link, we are investigating and will fix this asap.
Regarding your next issue: did you run the full list of commands in data/README.md
? (You can ignore the last command, which optionally removes words based on their hallucination scores. It's only used to reproduce the experiment called stnd_filtered
in the paper).
If you've ran all data commands, the last script data/co_occurrence.py
should create a file called data/wikibio/train_h.txt
which is the entire training set, one word per line with an empty line to separate between training instances. Each word also as an hallucination score. The first few lines should be something like:
walter 0
extra 0
is 0
a 0
german 0
award-winning 0.99803374201573
aerobatic 1
pilot 0.9795918367346939
, 0.9957763625735621
chief 0.9957763625735621
aircraft 0
designer 0
If this is the case, then you can run the following command:
python3 data/format_weights.py --orig data/wikibio.train_h.txt \
--dest train_weights.txt \
--strategy thresholds \
--thresholds 0.4 \
--normalize \
--weight_regularization 1 \
--eos_weights 1 0
You were right when you tried to replace the --orig
value, you simply didn't point towards the file.
Hope this helps, I'm leaving the issue open until all is resolved (including the broken link).
Let me know, Clément
Hi Clément,
You are right. When I used the updated command of python3 data/format_weights.py --orig data/wikibio/train_h.txt --dest train_weights.txt --strategy thresholds --thresholds 0.4 --normalize --weight_regularization 1 --eos_weights 1 0
, I had no more issues. Thanks!
I now get an assertion error when running the preprocessing step before training through the python3 run_onmt.py --preprocess --config preprocess.cfg
command. The error looks like this:
[2021-07-11 14:20:16,606 INFO] Extracting features...
[2021-07-11 14:20:16,606 INFO] * number of source features: 3.
[2021-07-11 14:20:16,607 INFO] * number of target features: 0.
[2021-07-11 14:20:16,607 INFO] Building `Fields` object...
[2021-07-11 14:20:16,607 INFO] Building & saving training data...
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/workspace/src/repos/dtt-multi-branch/onmt/bin/preprocess.py", line 52, in process_one_shard
assert len(src_shard) == len(tgt_shard)
AssertionError
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "run_onmt.py", line 22, in <module>
preprocess(remaining_args)
File "/workspace/src/repos/dtt-multi-branch/onmt/bin/preprocess.py", line 293, in main
preprocess(opt)
File "/workspace/src/repos/dtt-multi-branch/onmt/bin/preprocess.py", line 273, in preprocess
'train', fields, src_reader, tgt_reader, align_reader, opt)
File "/workspace/src/repos/dtt-multi-branch/onmt/bin/preprocess.py", line 200, in build_save_dataset
for sub_counter in p.imap(func, shard_iter):
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 748, in next
raise value
AssertionError
Do you know what the problem might be? And, shall I open a new issue for this? Thanks :)
This error is most likely due to a difference between the number of input tables and output sentences.
The first two lines of preprocess.cfg
are:
train_src: "data/wikibio/train_input.txt"
train_tgt: "data/wikibio/train_output.txt"
Can you check that those two files have the same number of lines? Ideally, the train_weights.txt
file should also have the same number of lines.
Hi @juelap ,
The download link was not valid anymore, as I mistakenly set up an expiration date. I just updated the README.md
file with a working (and unexpiring) link.
Best, Marco
This error is most likely due to a difference between the number of input tables and output sentences.
The first two lines of
preprocess.cfg
are:train_src: "data/wikibio/train_input.txt" train_tgt: "data/wikibio/train_output.txt"
Can you check that those two files have the same number of lines? Ideally, the
train_weights.txt
file should also have the same number of lines.
I checked and it looks like train_output.txt
and train_weights.txt
have the same number of lines: 526,575. On the other hand, train_input.txt
has one more, 526,576, which is causing the error. I don't know why this is happening. It might be because I am using Python 3.7, and not 3.8. Anyway, I will fix it by removing/adding one line manually. Thanks for the help! :)
Hi @juelap ,
The download link was not valid anymore, as I mistakenly set up an expiration date. I just updated the
README.md
file with a working (and unexpiring) link.Best, Marco
Thanks Marco :)
Hi guys,
After checking one more time, it looks like that all input files in wikibio, train_input.txt
, valid_input.txt
, and test_input.txt
have one more line than the respective output files. I resolved the assertion error by simply removing the last line from the input files, because that was not present in the output files. I guess this error happens as I am using Python 3.7 and maybe the error might still happen for lower versions. Since format_wikibio.py
file in data
creates these files, probably something needs to be changed there to accommodate the error in Python <3.8. Anyway, it is not a big deal :)
For anyone who is using Python 3.7 or lower, you can execute this script and write the filename which last line should be removed, accordingly:
import os
file_path = '<filename>' #write filename/filepath here
os.system('sed -i "$ d" {0}'.format(file_path))
Hi Juela,
I have run once more all commands on my side and did not encounter this additional line issue. I actually don't see how it could be related to the python version, but if you managed to make it work all the better.
To be extra sure, I have the following size for each set:
train=582647
valid=72830
test=72831
I'm closing this issue now, let us know if (when?) an other issue arises!
I am trying to run the format weights script and I get the following errors.
Initially, I can't download the file that is given through the
wget https://datacloud.di.unito.it/index.php/s/KPr9HnbMyNWqRdj/download
command because I get 404 not found error.Moreover, I thought of changing a little the next command so it uses the preprocessed files in
data/wikibio
and notdata/download
. The changed command looks like thispython3 data/format_weights.py --orig data/wikibio --dest train_weights.txt --strategy thresholds --thresholds 0.4 --normalize --weight_regularization 1 --eos_weights 1 0
and when I execute it, I get the error below:I am not sure what the next course of action should be.