Closed alecristia closed 3 years ago
Hi Alex, actually for DiBS the train text is mandatory in the phonologized form (ie with the tags), whereas the test file must be in prepared form (without tags). So we cannot do what you specify : "if user passes only one file, then use that file for both stages"...
Is it actually sufficient ? If not we can do something like "if user passes only train file, remove tags and use it for testing" but we cannot do the reverse.
Yes, I understand the problem, because dibs has an optimal version for which it needs the train file with tags. Hence their use of train and test -- which differ in the presence of tags.
"if user passes only train file, remove tags and use it for testing" is the right thing to do -- what I said is incorrect.
Would this work:
[ ] add an optional parameter that indicates that train and test files are not related
[ ] if user passes only train file, remove tags and use the version without tags for testing
[ ] if user passes two files and the parameter indicating that train and test are not related, then use the first file for train, the second for test
to this end,
[ ] check that both files exist, if not return error
[ ] check that the second file does NOT contain tags, if it does return error
[ ] always check that the first or only file contains tags, if not return error
[ ] and change these lines:
train_text = codecs.open(args.train_file, 'r', encoding='utf8') train_text = (line for line in train_text if line) test_text = (line for line in streamin if line)
The goal is to allow true cross validation (see also #44), where learning is frozen by the time the algo is tested.
For DiBS, this was contemplated originally but I suspect the option was deleted when simplifying. Here is the current code:
And the old code is: original_dibs.zip
What we should do then is:
[ ] add an optional parameter for test file different from train file [ ] if user passes only one file, then use that file for both stages [ ] if user passes both, then use train for train, test for test [ ] to this end, check that they both exist, if not return error [ ] and change these lines:
train_text = codecs.open(args.train_file, 'r', encoding='utf8') train_text = (line for line in train_text if line) test_text = (line for line in streamin if line)