bootphon / wordseg

A Python toolbox for text based word segmentation
https://docs.cognitive-ml.fr/wordseg
GNU General Public License v3.0
16 stars 7 forks source link

allow different corpora for train and test: DiBS (see also #44) #54

Closed alecristia closed 3 years ago

alecristia commented 4 years ago

The goal is to allow true cross validation (see also #44), where learning is frozen by the time the algo is tested.

For DiBS, this was contemplated originally but I suspect the option was deleted when simplifying. Here is the current code:

    # ensure the train file exists
    if not os.path.isfile(args.train_file):
        raise ValueError(
                'train file does not exist: {}'.format(args.train_file))

    # load train and test texts, ignore empty lines
    train_text = codecs.open(args.train_file, 'r', encoding='utf8')
    train_text = (line for line in train_text if line)
    test_text = (line for line in streamin if line)

    # train the model (learn diphone statistics)
    dibs_summary = CorpusSummary(
        train_text, separator=separator, level=args.unit, log=log)

    # segment the test text on the trained model
    output = segment(
        test_text,
        dibs_summary,
        type=args.type,
        threshold=args.threshold,
        pwb=args.pboundary,
        log=log)

    # output the segmented text
    streamout.write('\n'.join(output) + '\n')

And the old code is: original_dibs.zip

What we should do then is:

[ ] add an optional parameter for test file different from train file [ ] if user passes only one file, then use that file for both stages [ ] if user passes both, then use train for train, test for test [ ] to this end, check that they both exist, if not return error [ ] and change these lines: train_text = codecs.open(args.train_file, 'r', encoding='utf8') train_text = (line for line in train_text if line) test_text = (line for line in streamin if line)

mmmaat commented 4 years ago

Hi Alex, actually for DiBS the train text is mandatory in the phonologized form (ie with the tags), whereas the test file must be in prepared form (without tags). So we cannot do what you specify : "if user passes only one file, then use that file for both stages"...

Is it actually sufficient ? If not we can do something like "if user passes only train file, remove tags and use it for testing" but we cannot do the reverse.

alecristia commented 4 years ago

Yes, I understand the problem, because dibs has an optimal version for which it needs the train file with tags. Hence their use of train and test -- which differ in the presence of tags.

"if user passes only train file, remove tags and use it for testing" is the right thing to do -- what I said is incorrect.

Would this work: [ ] add an optional parameter that indicates that train and test files are not related [ ] if user passes only train file, remove tags and use the version without tags for testing [ ] if user passes two files and the parameter indicating that train and test are not related, then use the first file for train, the second for test to this end, [ ] check that both files exist, if not return error [ ] check that the second file does NOT contain tags, if it does return error [ ] always check that the first or only file contains tags, if not return error [ ] and change these lines: train_text = codecs.open(args.train_file, 'r', encoding='utf8') train_text = (line for line in train_text if line) test_text = (line for line in streamin if line)