aistairc / rnng-pytorch

MIT License
20 stars 7 forks source link

Assert in beam_search.py always fails #2

Closed ekayen closed 3 years ago

ekayen commented 3 years ago

First of all, thank you for making this repo public! It is so clear and easy to get running, and I'm excited to experiment with it.

I have trained a model on PTB and am now trying to run the particle filtering-based beam search on the validation set, using the command given in the README, just with batch size of 10 to accommodate memory limitations. I consistently get an assertion error at line 201 in beam_search.py:

assert cur_block_size == args.block_size

The problem seems to be that the code has divided some batches into sizes smaller than the batch size, and so sometimes the cumulative block size exceeds args.block_size. I am not sure if these small batches are desired behavior, or if this is where the error is getting in.

I also am not entirely sure of the desired input format of valid.tokens -- I assumed it should be a text file with one sentence per line, whitespace tokenized -- but if I'm wrong about those assumptions, that could conceivably be a source of errors.

Thanks!

hiroshinoji commented 3 years ago

Thank you for using the code! I've not tested particle filtering for a while so maybe I injected some bugs during customization of default beam search. Let me check and I'll update in a few days.

For the input format of valid.tokens, that is correct. I was using the script dump_tokens to get the token files for evaluation.

python ./scripts/dump_tokens.py val.trees > val.tokens

Replacement of rare tokens to unks (or handling of subwords) is internally done during preprocessing of beam_search.py.

hiroshinoji commented 3 years ago

Sorry for no update on this for long time. The problem was due to a bug of data loading and it was now fixed. We didn't notice this because the problem does not occur when block_size is set to be larger than the total number of sentences (say, 100000), which is commonly used in our experiments.

In practice, the inference time is maximized when block_size is set to such large value (exceeding the number of sentences). block_size determines the interval of dumping parsed sentences. When block_size=num_sentences, dumping only occurs at once at the end.

ekayen commented 3 years ago

Thank you for the fix and the explanation about block_size -- I'll set it high in future experiments