bootphon / wordseg

A Python toolbox for text based word segmentation
https://docs.cognitive-ml.fr/wordseg
GNU General Public License v3.0
16 stars 7 forks source link

allow different corpora for train and test: AG #56

Closed alecristia closed 3 years ago

alecristia commented 4 years ago

The goal is to allow true cross validation (see also #44 #54 #55 ), where learning is frozen by the time the algo is tested.

ag.py contains the following parameter:

utils.Argument(
    short_name='-u', name='--test-file', type='file', default=None,
    help=('test strings to be parsed (but not trained on) '
          'every eval-every iterations, default is to test on input')),

Notice also the following commented section:

# We ignore the following options because they conflict with the
# wordseg workflow (stdin > wordseg-cmd > stdout). In this AG
# wrapper the test2 file is ignored and the test1 is the input
# text sent to stdout.
#
# utils.Argument(
#     short_name='-X', name='--eval-parses-cmd', type='file',
#     help=('pipe each run\'s parses into this command '
#           '(empty line separates runs)')),
# utils.Argument(
#     short_name='-Y', name='--eval-grammar-cmd', type='file',
#     help=('pipe each run\'s grammar-rules into this command '
#           '(empty line separates runs)')),
# utils.Argument(
#     short_name='-U', name='--test1-eval', type='file',
#     help='parses of test1-file are piped into this command'),
# utils.Argument(
#     short_name='-v', name='--test2-file', type='file',
#     help=('test strings to be parsed (but not trained on) '
#           'every eval-every iterations')),
# utils.Argument(
#     short_name='-V', name='--test2-eval', type='file',
#     help='parses of test2-file are piped into this command')

Further below, under wrapper of the c++ program we have:

def _segment_single(parse_counter, train_text, grammar_file, category, ignore_first_parses, args, test_text=None, tempdir=tempfile.gettempdir(), log_level=logging.ERROR, log_name='wordseg-ag'): """Executes a single run of the AG program and postprocessing

For which the parameter: test_text : sequence, optional If not None, the test text contains the list of utterances to segment on the model learned from train_text

And even more useful:

def segment(train_text, grammar_file=None, category='Colloc0', args=DEFAULT_ARGS, test_text=None, save_grammar_to=None, ignore_first_parses=0, nruns=8, njobs=1, tempdir=tempfile.gettempdir(), log=utils.null_logger()): """Segment a text using the Adaptor Grammar algorithm

For which the parameter: test_text : sequence, optional If not None, the list of utterances to segment using the model learned from text

What we should do then is:

[ ] add an optional parameter for test file different from train file [ ] if user passes only one file, then use that file for both stages [ ] if user passes both, then use train for train, test for test [ ] to this end, check that they both exist, if not return error [ ] and make sure to use the parameter test_text in the call

LOGGING HERE BUT NOT USEFUL NOW AG's README mentions the following potentially useful options:

-X eval-cmd -- pipe each run's parses into this command (empty line separates runs) -Y eval-cmd -- pipe each run's grammar-rules into this command (empty line separates runs) -x eval-every -- pipe trees into the eval-cmd every eval-every iterations -u test1.yld -- test strings to be parsed (but not trained on) every eval-every iterations -U eval-cmd -- parses of test1.yld are piped into this command -v test2.yld -- test strings to be parsed (but not trained on) every eval-every iterations -V eval-cmd -- parses of test2.yld are piped into this command

In the context of:

py-cfg [-d debug] [-A parses-file] [-C] [-D] [-E] [-F trace-file] [-G grammar-file] [-H] [-I] [-P] [-R nr] [-r rand-init] [-n niterations] [-N nanal-its] [-a a] [-b b] [-w weight] [-e pya-beta-a] [-f pya-beta-b] [-g pyb-gamma-s] [-h pyb-gamma-c] [-s train_frac] -S [-T anneal-temp-start] [-t anneal-temp-stop] [-m anneal-its] [-Z ztemp] [-z zits] [-x eval-every] [-X eval-cmd] [-Y eval-cmd] [-u test1.yld] [-U eval-cmd] [-v test1.yld] [-V eval-cmd] grammar.lt < train.yld