The goal is to allow true cross validation (see also #44 #54 #55 ), where learning is frozen by the time the algo is tested.
ag.py contains the following parameter:
utils.Argument(
short_name='-u', name='--test-file', type='file', default=None,
help=('test strings to be parsed (but not trained on) '
'every eval-every iterations, default is to test on input')),
Notice also the following commented section:
# We ignore the following options because they conflict with the
# wordseg workflow (stdin > wordseg-cmd > stdout). In this AG
# wrapper the test2 file is ignored and the test1 is the input
# text sent to stdout.
#
# utils.Argument(
# short_name='-X', name='--eval-parses-cmd', type='file',
# help=('pipe each run\'s parses into this command '
# '(empty line separates runs)')),
# utils.Argument(
# short_name='-Y', name='--eval-grammar-cmd', type='file',
# help=('pipe each run\'s grammar-rules into this command '
# '(empty line separates runs)')),
# utils.Argument(
# short_name='-U', name='--test1-eval', type='file',
# help='parses of test1-file are piped into this command'),
# utils.Argument(
# short_name='-v', name='--test2-file', type='file',
# help=('test strings to be parsed (but not trained on) '
# 'every eval-every iterations')),
# utils.Argument(
# short_name='-V', name='--test2-eval', type='file',
# help='parses of test2-file are piped into this command')
Further below, under wrapper of the c++ program we have:
def _segment_single(parse_counter, train_text, grammar_file,
category, ignore_first_parses, args,
test_text=None, tempdir=tempfile.gettempdir(),
log_level=logging.ERROR, log_name='wordseg-ag'):
"""Executes a single run of the AG program and postprocessing
For which the parameter:
test_text : sequence, optional
If not None, the test text contains the list of utterances to
segment on the model learned from train_text
And even more useful:
def segment(train_text, grammar_file=None, category='Colloc0',
args=DEFAULT_ARGS, test_text=None,
save_grammar_to=None, ignore_first_parses=0,
nruns=8, njobs=1, tempdir=tempfile.gettempdir(),
log=utils.null_logger()):
"""Segment a text using the Adaptor Grammar algorithm
For which the parameter:
test_text : sequence, optional
If not None, the list of utterances to segment using the model
learned from text
What we should do then is:
[ ] add an optional parameter for test file different from train file
[ ] if user passes only one file, then use that file for both stages
[ ] if user passes both, then use train for train, test for test
[ ] to this end, check that they both exist, if not return error
[ ] and make sure to use the parameter test_text in the call
LOGGING HERE BUT NOT USEFUL NOW
AG's README mentions the following potentially useful options:
-X eval-cmd -- pipe each run's parses into this command (empty line separates runs)
-Y eval-cmd -- pipe each run's grammar-rules into this command (empty line separates runs)
-x eval-every -- pipe trees into the eval-cmd every eval-every iterations
-u test1.yld -- test strings to be parsed (but not trained on) every eval-every iterations
-U eval-cmd -- parses of test1.yld are piped into this command
-v test2.yld -- test strings to be parsed (but not trained on) every eval-every iterations
-V eval-cmd -- parses of test2.yld are piped into this command
The goal is to allow true cross validation (see also #44 #54 #55 ), where learning is frozen by the time the algo is tested.
ag.py contains the following parameter:
Notice also the following commented section:
Further below, under wrapper of the c++ program we have:
def _segment_single(parse_counter, train_text, grammar_file, category, ignore_first_parses, args, test_text=None, tempdir=tempfile.gettempdir(), log_level=logging.ERROR, log_name='wordseg-ag'): """Executes a single run of the AG program and postprocessing
For which the parameter: test_text : sequence, optional If not None, the test text contains the list of utterances to segment on the model learned from
train_text
And even more useful:
def segment(train_text, grammar_file=None, category='Colloc0', args=DEFAULT_ARGS, test_text=None, save_grammar_to=None, ignore_first_parses=0, nruns=8, njobs=1, tempdir=tempfile.gettempdir(), log=utils.null_logger()): """Segment a text using the Adaptor Grammar algorithm
For which the parameter: test_text : sequence, optional If not None, the list of utterances to segment using the model learned from
text
What we should do then is:
[ ] add an optional parameter for test file different from train file [ ] if user passes only one file, then use that file for both stages [ ] if user passes both, then use train for train, test for test [ ] to this end, check that they both exist, if not return error [ ] and make sure to use the parameter test_text in the call
LOGGING HERE BUT NOT USEFUL NOW AG's README mentions the following potentially useful options:
-X eval-cmd -- pipe each run's parses into this command (empty line separates runs) -Y eval-cmd -- pipe each run's grammar-rules into this command (empty line separates runs) -x eval-every -- pipe trees into the eval-cmd every eval-every iterations -u test1.yld -- test strings to be parsed (but not trained on) every eval-every iterations -U eval-cmd -- parses of test1.yld are piped into this command -v test2.yld -- test strings to be parsed (but not trained on) every eval-every iterations -V eval-cmd -- parses of test2.yld are piped into this command
In the context of:
py-cfg [-d debug] [-A parses-file] [-C] [-D] [-E] [-F trace-file] [-G grammar-file] [-H] [-I] [-P] [-R nr] [-r rand-init] [-n niterations] [-N nanal-its] [-a a] [-b b] [-w weight] [-e pya-beta-a] [-f pya-beta-b] [-g pyb-gamma-s] [-h pyb-gamma-c] [-s train_frac] -S [-T anneal-temp-start] [-t anneal-temp-stop] [-m anneal-its] [-Z ztemp] [-z zits] [-x eval-every] [-X eval-cmd] [-Y eval-cmd] [-u test1.yld] [-U eval-cmd] [-v test1.yld] [-V eval-cmd] grammar.lt < train.yld