danpovey / pocolm

Small language toolkit for creation, interpolation and pruning of ARPA language models
Other
90 stars 48 forks source link

Max memory #39

Open danpovey opened 8 years ago

danpovey commented 8 years ago

We need a way to limit how much memory the toolkit will use. The only operation that uses a lot of memory is 'sort', and I believe this is only called in two places-- when generating counts, and in ARPA generation. In both cases, the way we can control it is by using the --buffer-size=X option to 'sort', e.g. --buffer-size=10G. The tricky thing here is we'd like to be able to pass in a --max-memory=X option from the top-level scripts, such as train_lm.py, and have them just do the right thing, while bearing in mind that some of the scripts may invoke 'sort' multiple times in parallel in some instances. So in some instances this would involve dividing the memory requirement by a certain number, e.g. changing 100G to 25G. [you can just treat any letter at the end as an arbitrary string. please don't assume there is a letter, as a simple numeric argument can be treated as a number of bytes.].

@keli

Dan