We need a way to limit how much memory the toolkit will use.
The only operation that uses a lot of memory is 'sort', and I believe this is only called in two places-- when generating counts, and in ARPA generation.
In both cases, the way we can control it is by using the --buffer-size=X option to 'sort', e.g. --buffer-size=10G.
The tricky thing here is we'd like to be able to pass in a --max-memory=X option from the top-level scripts, such as train_lm.py, and have them just do the right thing, while bearing in mind that some of the scripts may invoke 'sort' multiple times in parallel in some instances. So in some instances this would involve dividing the memory requirement by a certain number, e.g. changing 100G to 25G. [you can just treat any letter at the end as an arbitrary string. please don't assume there is a letter, as a simple numeric argument can be treated as a number of bytes.].
We need a way to limit how much memory the toolkit will use. The only operation that uses a lot of memory is 'sort', and I believe this is only called in two places-- when generating counts, and in ARPA generation. In both cases, the way we can control it is by using the --buffer-size=X option to 'sort', e.g. --buffer-size=10G. The tricky thing here is we'd like to be able to pass in a --max-memory=X option from the top-level scripts, such as train_lm.py, and have them just do the right thing, while bearing in mind that some of the scripts may invoke 'sort' multiple times in parallel in some instances. So in some instances this would involve dividing the memory requirement by a certain number, e.g. changing 100G to 25G. [you can just treat any letter at the end as an arbitrary string. please don't assume there is a letter, as a simple numeric argument can be treated as a number of bytes.].
@keli
Dan