Closed ekayen closed 3 years ago
Thank you for using the code! I've not tested particle filtering for a while so maybe I injected some bugs during customization of default beam search. Let me check and I'll update in a few days.
For the input format of valid.tokens
, that is correct. I was using the script dump_tokens to get the token files for evaluation.
python ./scripts/dump_tokens.py val.trees > val.tokens
Replacement of rare tokens to unks (or handling of subwords) is internally done during preprocessing of beam_search.py.
Sorry for no update on this for long time. The problem was due to a bug of data loading and it was now fixed. We didn't notice this because the problem does not occur when block_size
is set to be larger than the total number of sentences (say, 100000), which is commonly used in our experiments.
In practice, the inference time is maximized when block_size
is set to such large value (exceeding the number of sentences). block_size
determines the interval of dumping parsed sentences. When block_size=num_sentences
, dumping only occurs at once at the end.
Thank you for the fix and the explanation about block_size
-- I'll set it high in future experiments
First of all, thank you for making this repo public! It is so clear and easy to get running, and I'm excited to experiment with it.
I have trained a model on PTB and am now trying to run the particle filtering-based beam search on the validation set, using the command given in the README, just with batch size of 10 to accommodate memory limitations. I consistently get an assertion error at line 201 in beam_search.py:
assert cur_block_size == args.block_size
The problem seems to be that the code has divided some batches into sizes smaller than the batch size, and so sometimes the cumulative block size exceeds
args.block_size
. I am not sure if these small batches are desired behavior, or if this is where the error is getting in.I also am not entirely sure of the desired input format of
valid.tokens
-- I assumed it should be a text file with one sentence per line, whitespace tokenized -- but if I'm wrong about those assumptions, that could conceivably be a source of errors.Thanks!