The goal is to allow true cross validation (see also #44 #54 ), where learning is frozen by the time the algo is tested.
For TP, note the current code always says things like def _threshold_relative(units, tps) -- so nothing forces the system to compute the tps and parsing in the same text. This is a section of the current code:
# join all the utterances together, seperated by ' UB '
units = [unit for unit in ' UB '.join(
line.strip() for line in text).split()]
# compute and count all the unigrams and bigrams (two successive units)
unigrams = collections.Counter(units)
bigrams = collections.Counter(zip(units[0:-1], units[1:]))
# compute the transitional probabilities accordoing to the given
# dependency measure
if dependency == 'ftp':
tps = {bigram: float(freq) / unigrams[bigram[0]]
for bigram, freq in bigrams.items()}
elif dependency == 'btp':
tps = {bigram: float(freq) / unigrams[bigram[1]]
for bigram, freq in bigrams.items()}
else: # dependency == 'mi'
tps = {bigram: math.log(float(freq) / (
unigrams[bigram[0]] * unigrams[bigram[1]]), 2)
for bigram, freq in bigrams.items()}
# segment the input given the transition probalities
cwords = (_threshold_relative(units, tps) if threshold == 'relative'
else _threshold_absolute(units, tps))
The cleanest is to:
[ ] add an optional parameter for test file different from train file
[ ] if user passes only one file, then use that file for both stages
[ ] if user passes both, then use train for train, test for test
[ ] to this end, check that they both exist, if not return error
[ ] extract these lines into a function that is extract_TPs, input train, output tps:
# join all the utterances together, separated by ' UB '
units = [unit for unit in ' UB '.join(
line.strip() for line in text).split()]
# compute and count all the unigrams and bigrams (two successive units)
unigrams = collections.Counter(units)
bigrams = collections.Counter(zip(units[0:-1], units[1:]))
# compute the transitional probabilities according to the given
# dependency measure
if dependency == 'ftp':
tps = {bigram: float(freq) / unigrams[bigram[0]]
for bigram, freq in bigrams.items()}
elif dependency == 'btp':
tps = {bigram: float(freq) / unigrams[bigram[1]]
for bigram, freq in bigrams.items()}
else: # dependency == 'mi'
tps = {bigram: math.log(float(freq) / (
unigrams[bigram[0]] * unigrams[bigram[1]]), 2)
for bigram, freq in bigrams.items()}
[ ] extract these lines into a function that is the current segment function, except we should make sure that the input is the text text AND tps outputted by extract_TPs function:
# join all the utterances together, separated by ' UB '
units = [unit for unit in ' UB '.join(
line.strip() for line in text).split()]
# segment the input given the transition probabilities
cwords = (_threshold_relative(units, tps) if threshold == 'relative'
else _threshold_absolute(units, tps))
The goal is to allow true cross validation (see also #44 #54 ), where learning is frozen by the time the algo is tested.
For TP, note the current code always says things like
def _threshold_relative(units, tps)
-- so nothing forces the system to compute the tps and parsing in the same text. This is a section of the current code:The cleanest is to:
[ ] add an optional parameter for test file different from train file [ ] if user passes only one file, then use that file for both stages [ ] if user passes both, then use train for train, test for test [ ] to this end, check that they both exist, if not return error [ ] extract these lines into a function that is extract_TPs, input train, output tps:
[ ] extract these lines into a function that is the current segment function, except we should make sure that the input is the text text AND tps outputted by extract_TPs function:
[ ] in main, call extract_TPs first, then segment