Closed mohammad2928 closed 3 years ago
Hi, the point of the crazy notation I suggested is that it is positional. This means that each position in the code should say something with a single character. An 'n' at that position means that it is not applicable.
So use this instead:
n ... not considering, not using
L ... lowercasing
P ... removing punctuation
C ... concatenating all sentences
W ... using mwersegmemter
M ... using Moses tokenizer
the code should then have more positions:
LPnWM
is probably the standard mwerSegmenter approach.
But since n
is hard to interpret (you have to know the meaning of the position), we could go for listing just the positive ones:
LPWM
LPCM
.. all concatenated, lowercaser, moses-tokenizedWM
.. picky evaluation, just mwersegmenter + tokenization(The following is for discussion only, not to be implemented.)
Since I am horribly struggling with deciphering these flags myself, I tried also to propose a verbose version:
...but I do not really like it, because the descriptions would be too long: lowercased-punctremoved-concatenated-mosestokenized for LPCM.
What is the status of this in the code? I see no "L" used in outputs but "L" present in the legend. I also see "n"s still being used.
I have used the following meta-information:
n ... not considering, not using
L ... lowercasing
P ... removing punctuation
C ... concatenating all sentences
W ... using mwersegmemter
M ... using Moses tokenizer
Also, I applied them just in the WER scores. But, I will remove 'n' in the next version and I will apply them in the Bleu, delay, flicker scores as well.
It's done.
Hi all,
It's better to changing ASRev to be more flexible in the preprocessing phase (P). I think you can alternative a number with the "P" character in the output of the ASRev as follow:
01- Lowercasing 02- Removing punctuations 03- (01, 02)
Also, it will support more statements (according to the preprocessing type)
So the output of ASRev,
will be changed to:
Best, Mohammad