Running SLTev on ESIC (interpretation corpus)

bhaddow commented 3 years ago

Hi

We are trying to test text-to-text translation on the ESIC corpus. The problem is that ESIC is document aligned, but not sentence aligned. The documents are segmented, but the number of segments does not match between source and target, so SLTev throws an error. Yet it states in the documentation that "segmentation can differ from the reference one"

How should this case be handled in SLTev?

evaluation for  en.OSt.man.orto.txt.slt  failed, the number of Complete lines (C) in  en.OSt.man.orto.txt.OStt  and  en.IStde.man.orto.txt  are not equal
The number of C segment (complete) in en.OSt.man.orto.txt.OStt is 2693 and number of lines in en.IStde.man.orto.txt is 2900

best Barry

mohammad2928 commented 3 years ago

Hi Barry,

when you want to evaluate SLT files (MT with timestamps), you need to OStt (OSt with timestamps) and reference. So, the number of complete segments in OStt and reference must be equal.

Yet it states in the documentation that "segmentation can differ from the reference one"

Yes, it means candidates (MT, SLT, ...) can have a different segmentation from the reference. But, in this case, OStt and reference have a different segmentation and it is not correct.

Please notice that OStt and reference files are gold.

If possible, please share with me some of your evaluation files for more help.

Thanks, Mohammad

bhaddow commented 3 years ago

Hi Mohammed

Do the OStt and reference files need to match for the latency and flicker calculation? Because bleu could use docAsWhole.

I'm attaching the files below (I have to give them txt extensions)

ref en.IStde.man.orto.txt

Ostt en.OSt.man.orto.txt.OStt.txt

slt en.OSt.man.orto.txt.slt.txt

best Barry

mohammad2928 commented 3 years ago

Hi,

There are two ways for solving this problem: 1- The first one is to convert SLT to MT and then using MTeval. 2- The second one is to calculate the BLEU score and flicker for these types of files (the number of complete segments in reference and OStt are not equal.) (If need I will update SLTev to support these files)

what do you think about them? which one is better? If you have an idea please share it with me.

Also, I am going to add some scripts and the main points to convert candidate types to each other.

For example, I am going to add the following main points: SLTtoMT (convert SLT to MT ) MTtoSLT (convert MT to SLT ) ASRTtoASR (convert ASRT to ASR) ASRtoASRT (convert ASR to ASRT)

What do you think about it? Are they useful?

Best, Mohammad

bhaddow commented 3 years ago

Hi Mohammed

So for solution 1, this would just evaluate the output as MT? If we converted each document to a single segment then this would work, since the corpus is document aligned. But we could then just do this directly with sacrebleu?

I think 2 is more useful, but that means we have to solve how to define flicker and latency when the OStt C-segments and the reference segments do not match - is that correct? I am not sure what the solution is, I would have to think about.

For the conversion tools, is the MT to SLT tool similar to what I asked about in #33 ? Yes, that would be useful. I think @sukantasen has a script for that. For the other direction, it's less important.

best Barry

mohammad2928 commented 3 years ago

Hi Barry,

mohammad2928 commented 3 years ago

Hi Barry,

So for solution 1, this would just evaluate the output as MT? If we converted each document to a single segment then this > would work, since the corpus is document aligned. But we could then just do this directly with sacrebleu?

I think 2 is more useful, but that means we have to solve how to define flicker and latency when the OStt C-segments and the reference segments do not match - is that correct? I am not sure what the solution is, I would have to think about.

I have updated SLTev for solving this issue, you can upgrade it to the version v1.2. 2 (pip install --upgrade SLTev) When the number of complete segments in the OStt and reference is not equal, just Bleuscore (docsAsWhole and MWERsegmenter) and Flicker will be calculated.

For the conversion tools, is the MT to SLT tool similar to what I asked about in #33 ? Yes, that would be useful. I think @sukantasen has a script for that. For the other direction, it's less important.

I will add them in the next version.

Best, Mohammad

bhaddow commented 3 years ago

Hi Mohammed

That sounds good!

For "docAsWhole", does that mean that the whole test corpus is treated as a document? Is there any way to tell SLTev that the corpus is made up of a number of smaller documents?

best Barry

mohammad2928 commented 3 years ago

Hi Barry,

For "docAsWhole", does that mean that the whole test corpus is treated as a document?

Yes, it concatenates whole test corpus segments as a document.

Is there any way to tell SLTev that the corpus is made up of a number of smaller documents?

I do not understand your meaning exactly! But in the second BleuScore calculation (using MWERsegmenter), it uses MWERsegmenter to resegment candidate segments according to the reference segments. But there is no way to tell SLTev that the corpus is made of multiple documents. If you have any idea for that please share with me.

Best, Mohammad

bhaddow commented 3 years ago

But there is no way to tell SLTev that the corpus is made of multiple documents.

This is what I thought. We could allow the user to mark them somehow, for example leaving a blank line between the documents in the reference file? Or we could add an extra letter to the C/P annotation in the source?

mohammad2928 commented 3 years ago

I think using blank line make SLTev complex.
Could you please explain what is the purpose of the multi-document files?

I think you are going to get scores for each document seperatly. Yes? I think the best way is making an special token (e.g. ###ENDDOCUMNET###) and making a parameter (e.g. --multi) to define the types of documents for evaluation. But the main question is that how we should treat with the multi-document files.

proposed idea: 1- Making a parser to pars multi-documents and split them to the multiple documents. Note: Reference, candidate(e.g. SLT) and ostt must be multi-documents (contain defined token). 2- evaluating splited documents speratly

For example, suppose there are 3 files test.slt, test.ostt and test.ref and test.ref contains 3 documents(e.g. [1,2,3]). each file contains two defined tokens.

step1: parse files per number of documents. So there will be the following files: [test.1.slt, test.2.slt, test.3.slt] [test.1.ref, test.2.ref, test.3.ref] [test.1.ostt, test.2.ostt, test.3.ostt]

step2:
The following evaluations run seperatly. SLTeval -i test.1.slt test.1.ref test.1.ostt -f slt ref ostt
SLTeval -i test.2.slt test.2.ref test.2.ostt -f slt ref ostt
SLTeval -i test.3.slt test.3.ref test.3.ostt -f slt ref ostt

What do you think about this idea?

bhaddow commented 3 years ago

Hi Mohammed

The dev set of the ESIC corpus is aligned at the document level, but not at the sentence level. It contains 28 documents. It's a corpus of interpretation, so should be ideal for evaluating simultaneous SLT, and so we want to use SLTev.

Evaluating all 28 documents separately seems quite awkward for the user. Adding a special token between documents could work.

best Barry

mohammad2928 commented 3 years ago

Hi Barry,

Could you please share with me an example of the dev set?

Evaluating all 28 documents separately seems quite awkward for the user.

What scores would you expect to evaluate? calculating BleuScore and Flicker are ok but the Delay is complex.

The next idea (for calculating quality score): 1- Concatenating all document sentences as a sentence (in candidate and reference). 2- Calculating the BleuScore Note: we this method we can use MWERsegmenter to resegment candidate segments.
Do you agree with that?

For example, if there is a file with 28 documents. we will have a file with 28 lines.

Best, Mohammad

bhaddow commented 3 years ago

Hi Mohammed

I linked to a document above, and it's from the ESIC corpus https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3719

For example, if there is a file with 28 documents. we will have a file with 28 lines.

Yes, this seems fine. We could prepare the data this way, or SLTev could convert it internally. But I am not sure why calculating delay is complex? Each of the 28 documents is a speech, and there is an ostt file with timestamps for each speech

best Barry

mohammad2928 commented 3 years ago

Hi Barry,

I linked to a document above, and it's from the ESIC corpus https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3719

Thanks.

Yes, this seems fine. We could prepare the data this way, or SLTev could convert it internally. But I am not sure why calculating delay is complex? Each of the 28 documents is a speech, and there is an ostt file with timestamps for each speech

SLTev will convert them internally. Delay calculation will be complex if we want to calculate the delay for each document separately. For example, if we have a file with 20 documents, there is a need to calculate all types of delays for each document. But if we want to calculate delay once for a file (with several documents) it is ok.

Let's summarise the above messages and if you agree with them please confirm and, I will start to implement them.

A. Adding a file converter to the SLTev

SLTtoREF (for convert SLT/ASRT files to reference/ASR files) Note: conversely, it is not useful, because delay and flicker are zero for MT and ASR evaluation.

B. Adding multi-document support module

Adding a token to separate documents in the OStt, reference, and candidates (e.g. ###ENDDOCUMNET###).
Removing the separation token and calculation delay and flicker as normally.
For quality scores, concatenating all document sentences as a sentence (in candidate and reference).

Thanks, Mohammad

bhaddow commented 3 years ago

Hi Mohammad

This sounds good. Just a couple of questions:

SLTtoREF (for convert SLT/ASRT files to reference/ASR files)

What is this needed for?

Removing the separation token and calculation delay and flicker as normally.

In our case, the timestamps start from 0 for each document, so will this work as normal?

best Barry

mohammad2928 commented 3 years ago

Hi,

What is this needed for?

Yes, SLTeval needs the OStt files for evaluation and in some cases, there is no OStt file, so we need to convert SLT files to MT to evaluate with MTeval.

In our case, the timestamps start from 0 for each document, so will this work as normal?

Unfortunately, they are not normal and we can not use them normally, so we need to calculate delay scores for each document separately. But just a question:

How can we display delay scores? Is there a need to print scores for each document? I think, we can get and show the average of them.

Best, Mohammad

bhaddow commented 3 years ago

Hi Mohammad

I meant that delay will work as normal "in each document". I agree that we need to find some way of combining them, and I think the default should be a mean.

best Barry

mohammad2928 commented 3 years ago

Hi Barry,

Thanks for the good interaction and consultation. I will start to implement the multi-docs evaluation module in the SLTev.

Thanks, Mohammad

mohammad2928 commented 3 years ago

Hi Barry,

I have prepared a multi-docs evaluation version of the SLTev. please upgrade to the version 1.2.3. Please read the README and if you want please update it.

you can use the following files as samples:

docs.ref

Stejně jako většina komunit máme i svá pravidla a řídící orgán.
Uživatelé zveřejňují všechny nejnovější zprávy a události.
###docSpliter###
Planet Ubuntu je sbírka komunitních blogů.
Ubuntu je v současné době financována společností Canonical Ltd.

docs.ostt

P 215 218 Like
P 215 220 Like most
P 215 223 Like most communities
P 215 228 Like most communities we
P 215 237 Like most communities we have our rules
P 215 240 Like most communities we have our rules and
P 215 244 Like most communities we have our rules and governing
C 215 250 Like most communities we have our rules and governing body.
P 254 256 The
P 254 260 The users
P 254 263 The users post
P 254 270 The users post all the latest
P 254 273 The users post all the latest news
P 254 277 The users post all the latest news and events
C 254 283 The users post all the latest news and events.
###docSpliter###
P 290 293 Planet
P 290 298 Planet Ubuntu
P 290 301 Planet Ubuntu is a
P 290 307 Planet Ubuntu is a collection
P 290 309 Planet Ubuntu is a collection of
P 290 312 Planet Ubuntu is a collection of community
P 290 314 Planet Ubuntu is a collection of community
C 290 319 Planet Ubuntu is a collection of community blogs.
P 325 328 Ubuntu
P 325 330 Ubuntu is
P 325 337 Ubuntu is currently
P 325 344 Ubuntu is currently funded by
P 325 348 Ubuntu is currently funded by Canonical
C 325 352 Ubuntu is currently funded by Canonical Ltd.

docs.slt

P 233 218 229 Jako většina komunit
P 236 218 232 Stejně jako většina komunit i my
P 244 218 240 Stejně jako většina komunit máme i naše závod
P 251 218 249 Stejně jako většina komunit máme i závod a
C 263 218 260 Jako většina komunit máme i náš závod a řídící orgán.
P 272 262 269 Uživatelé příspěvky
P 286 262 281 Uživatelé zveřejňují všechny nejnovější
P 295 262 288 Uživatelé zveřejňují všechny poslední zprávy
C 303 262 296 Uživatelé zveřejňují všechny poslední zprávy a díla.
###docSpliter###
P 310 300 307 Planeta Ubuntu
P 308 300 316 Planet Ubuntu je kolekce
P 310 300 318 Planet Ubuntu je sada
P 329 300 323 Planet Ubuntu je sada blogů
P 339 300 334 Planet Ubuntu je sada blogů je financováno
P 348 300 341 Planet Ubuntu je sada blogů je financováno společností
C 361 300 352 Planet Ubuntu je sada blogů a Ubuntu je financováno společností Canonical Lt.

usage:

SLTeval -i docs.slt docs.ref docs.ostt -f slt ref ostt --docs

Thanks, Mohammad

bhaddow commented 3 years ago

Great thanks - we will try it. @sukantasen

ELITR / SLTev

Running SLTev on ESIC (interpretation corpus) #67