danpovey / pocolm

Small language toolkit for creation, interpolation and pruning of ARPA language models
Other
90 stars 48 forks source link

train_lm.py usage #65

Open danpovey opened 8 years ago

danpovey commented 8 years ago

The usage message of train_lm.py (see below) does not agree with what the program actually does. The usage message suggests the output goes to lm_dir, but it goes to a subdirectory. I think you should rename lm_dir in the args to work_dir. And the usage message should explain what the location of the actual lm_dir output will be. There should be an "epilog" provided to the usage message, with an example usage- preferably a couple of example usages, one with a vocab and one with num-words specified. Also, you are using the 'basename' of the wordlist as part of the name of the lm_dir. What if the wordlist has a suffix, like foo.txt? Then foo.txt will become part of that name. It seems to me not ideal. Maybe strip any final suffix.

-------
usage: train_lm.py [-h] [--wordlist WORDLIST] [--num-words NUM_WORDS] [--num-splits NUM_SPLITS] [--warm-start-ratio WARM_START_RATIO]
                   [--min-counts MIN_COUNTS] [--limit-unk-history {true,false}] [--fold-dev-into FOLD_DEV_INTO]
                   [--bypass-metaparameter-optimization BYPASS_METAPARAMETER_OPTIMIZATION] [--verbose {true,false}] [--cleanup {true,false}]
                   [--keep-int-data {true,false}] [--max-memory MAX_MEMORY]
                   text_dir order lm_dir

This script trains an n-gram language model with <order> from <text-dir> and writes out the model to <lm-dir>. The output model dir is in pocolm-
format, user can call format_arpa_lm.py with <lm-dir> to get a ARPA-format model. Pruning a model could be achieve by call prune_lm_dir.py with
<lm-dir>.
danpovey commented 8 years ago

also cleanuped->cleaned up

danpovey commented 8 years ago

How about adding a final optional 4th argument called lm_dir (the 3rd argument being 'work_dir'), so the user can specify where they want the final LM to be written? This will make life easier for callers, as they won't have to figure out where pocolm would put their stuff.

wantee commented 8 years ago

OK, I will add the work_dir and lm_dir argument. Regards to the wordlist name, I know it is not ideal. But I think we should not remove the final suffix if it is meaningful. For example, we have 3 different wordlist and named them as 'vocab.1', 'vocab.2' and 'vocab.3', they can't be distinguished if we remove the suffix.

danpovey commented 8 years ago

OK, don't remove the suffix then.. if people want control they can add the lm_dir argument. Also, you won't have to create a subdirectory 'work' once you add the work_dir argument. Dan

On Sat, Sep 3, 2016 at 10:13 PM, Wang Jian notifications@github.com wrote:

OK, I will add the work_dir and lm_dir argument. Regards to the wordlist name, I know it is not ideal. But I think we should not remove the final suffix if it is meaningful. For example, we have 3 different wordlist and named them as 'vocab.1', 'vocab.2' and 'vocab.3', they can't be distinguished if we remove the suffix.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/65#issuecomment-244580190, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu4t6LbA7jGFFwwCXvmfhZYGdHPH_ks5qmilSgaJpZM4J0Xzp .

wantee commented 8 years ago

Yes, of course.