bootphon / wordseg

A Python toolbox for text based word segmentation
https://docs.cognitive-ml.fr/wordseg
GNU General Public License v3.0
16 stars 7 forks source link

allow different corpora for train and test: TPs #55

Closed alecristia closed 3 years ago

alecristia commented 4 years ago

The goal is to allow true cross validation (see also #44 #54 ), where learning is frozen by the time the algo is tested.

For TP, note the current code always says things like def _threshold_relative(units, tps) -- so nothing forces the system to compute the tps and parsing in the same text. This is a section of the current code:

    # join all the utterances together, seperated by ' UB '
    units = [unit for unit in ' UB '.join(
        line.strip() for line in text).split()]

    # compute and count all the unigrams and bigrams (two successive units)
    unigrams = collections.Counter(units)
    bigrams = collections.Counter(zip(units[0:-1], units[1:]))

    # compute the transitional probabilities accordoing to the given
    # dependency measure
    if dependency == 'ftp':
        tps = {bigram: float(freq) / unigrams[bigram[0]]
               for bigram, freq in bigrams.items()}
    elif dependency == 'btp':
        tps = {bigram: float(freq) / unigrams[bigram[1]]
               for bigram, freq in bigrams.items()}
    else:  # dependency == 'mi'
        tps = {bigram: math.log(float(freq) / (
            unigrams[bigram[0]] * unigrams[bigram[1]]), 2)
               for bigram, freq in bigrams.items()}

    # segment the input given the transition probalities
    cwords = (_threshold_relative(units, tps) if threshold == 'relative'
              else _threshold_absolute(units, tps))

The cleanest is to:

[ ] add an optional parameter for test file different from train file [ ] if user passes only one file, then use that file for both stages [ ] if user passes both, then use train for train, test for test [ ] to this end, check that they both exist, if not return error [ ] extract these lines into a function that is extract_TPs, input train, output tps:

    # join all the utterances together, separated by ' UB '
    units = [unit for unit in ' UB '.join(
        line.strip() for line in text).split()]

    # compute and count all the unigrams and bigrams (two successive units)
    unigrams = collections.Counter(units)
    bigrams = collections.Counter(zip(units[0:-1], units[1:]))

    # compute the transitional probabilities according to the given
    # dependency measure
    if dependency == 'ftp':
        tps = {bigram: float(freq) / unigrams[bigram[0]]
               for bigram, freq in bigrams.items()}
    elif dependency == 'btp':
        tps = {bigram: float(freq) / unigrams[bigram[1]]
               for bigram, freq in bigrams.items()}
    else:  # dependency == 'mi'
        tps = {bigram: math.log(float(freq) / (
            unigrams[bigram[0]] * unigrams[bigram[1]]), 2)
               for bigram, freq in bigrams.items()}

[ ] extract these lines into a function that is the current segment function, except we should make sure that the input is the text text AND tps outputted by extract_TPs function:

    # join all the utterances together, separated by ' UB '
    units = [unit for unit in ' UB '.join(
        line.strip() for line in text).split()]

    # segment the input given the transition probabilities
    cwords = (_threshold_relative(units, tps) if threshold == 'relative'
              else _threshold_absolute(units, tps))

[ ] in main, call extract_TPs first, then segment