Preprocessing in the ASRev

mohammad2928 commented 3 years ago

Hi all,

It's better to changing ASRev to be more flexible in the preprocessing phase (P). I think you can alternative a number with the "P" character in the output of the ASRev as follow:

01- Lowercasing 02- Removing punctuations 03- (01, 02)

Also, it will support more statements (according to the preprocessing type)

So the output of ASRev,

n ... not considering, not using   
P ... preprocessing contains Lowercase, punctuation removing          
C ... concatenating all sentences                                                                                                                                                                                 
W ... using mwersegmemter                                                                                                                                                                                         
M ... using Moses tokenizer

will be changed to:

n   ... not considering, not using   
01 ... lowercasing (01), removing punctuations (02), 01 and 02 (03)
C   ... concatenating all sentences                                                                                                                                                                                 
W  ... using mwersegmemter                                                                                                                                                                                         
M  ... using Moses tokenizer

Best, Mohammad

obo commented 3 years ago

Hi, the point of the crazy notation I suggested is that it is positional. This means that each position in the code should say something with a single character. An 'n' at that position means that it is not applicable.

So use this instead:

n ... not considering, not using   
L ... lowercasing
P ... removing punctuation          
C ... concatenating all sentences                                                                                                                                                                                 
W ... using mwersegmemter                                                                                                                                                                                         
M ... using Moses tokenizer

the code should then have more positions:

LPnWM is probably the standard mwerSegmenter approach.

But since n is hard to interpret (you have to know the meaning of the position), we could go for listing just the positive ones:

LPWM
LPCM .. all concatenated, lowercaser, moses-tokenized
WM .. picky evaluation, just mwersegmenter + tokenization

(The following is for discussion only, not to be implemented.)

Since I am horribly struggling with deciphering these flags myself, I tried also to propose a verbose version:

L: lowercased
P: punctremoved
C: concatenated (=all treated as one sentence)
W: mwersegmented
M: mosestokenized

...but I do not really like it, because the descriptions would be too long: lowercased-punctremoved-concatenated-mosestokenized for LPCM.

obo commented 3 years ago

What is the status of this in the code? I see no "L" used in outputs but "L" present in the legend. I also see "n"s still being used.

mohammad2928 commented 3 years ago

I have used the following meta-information:

n ... not considering, not using   
L ... lowercasing
P ... removing punctuation          
C ... concatenating all sentences                                                                                                                                                                                 
W ... using mwersegmemter                                                                                                                                                                                         
M ... using Moses tokenizer

Also, I applied them just in the WER scores. But, I will remove 'n' in the next version and I will apply them in the Bleu, delay, flicker scores as well.

mohammad2928 commented 3 years ago

It's done.

ELITR / SLTev

Preprocessing in the ASRev #2