ELITR / SLTev

SLTev is a tool for comprehensive evaluation of (simultaneous) spoken language translation.
8 stars 3 forks source link

easy and flexible entry points #47

Closed Gldkslfmsd closed 3 years ago

Gldkslfmsd commented 3 years ago

Hello,

I have two ASR candidates and I want to compare WER. I want an option to run "eval-ASR reference < hypothesis".

Right now I can do it with SLTev, but it's so complicated to do it manually that it needs a wrapper script which creates a directory, puts there the files with some names and dummy values to fit the format, runs it, and then parses and prints the results, and cleans the mess that was necessary to make. Why the wrapper is not inside SLTev?

And similarly with other options: eval finalized MT quality, flicker, latency etc.

mohammad2928 commented 3 years ago

Hi Dominik,

I am going to add some main points to the SLTev. I have two ideas about it as follow:

1- One main point (e.g. SLTeval) with some parameters. In the following I will represent a demo for it:

SLT/MT evaluation:

 SLTeval --slt <SLT/MT-path> --source <Source-path> --reference <reference-paths> 

ASR evaluation:

SLTeval --asr <ASR-path> --source <Source-path> 

2- Two independent main points (e.g. SLTeval and ASReval). I have two ideas for this section as follow:

A- With parameters:

SLT/MT evaluation:

SLTeval --slt <SLT/MT-path> --source <Source-path> --reference <reference-paths> 

ASR evaluation:

ASReval --asr <ASR-path> --source <Source-path> 

B- Without parameters:

SLT/MT evaluation:

SLTeval <Source-path>  <reference-paths>  < <SLT/MT-path>

ASR evaluation:

ASReval <Source-path> < <ASR-path>

Note: SLTeval and ASReval are only my suggestion names, if you have a better name for them please let me know.

Please let me know what you think and which one is more simple to use. Any comments and suggestions will be appreciated.

Thanks,

Gldkslfmsd commented 3 years ago

Good start. I like 2A the most, but it's still not everything I suggest. You assume there are two tasks, SLTev and ASRev, but I can see more:

1) finalized reference + finalized MT hypothesis => MT quality (one BLEU score per corpus)

2) timestamped online MT hypothesis => flicker (=erasure, more accurately)

3) finalized gold transcript + finalized ASR transcript => WER

4) timestamped gold transcript + timestamped online MT => end-to-end latency (utterance to translation)

5) timestamped ASR hypothesis + timestamped online MT => translation latency (ASR to translation)

6) timestamped online MT hypothesis + timestamped reference translation => MT quality in time segments

7) ...etc.

I think that each task should be runnable in isolation on others, and have an entry point. Then it is useful for users who e.g. have timestamped online MT, but not the other files, and want to know flicker and nothing else. And don't want to create the dummy files to fit the format and then look for the one meaningful number mixed within others.

Next, it's important to run the entry points on

I) single document = single tuple of (reference, gold transcript, mt hypothesis ...) source files

II) list of documents = list of tuples -- and then aggregate the scores

III) Someone could like to evaluate on ELITR testset.

For I and II), there could be two options how to pass the files. On input or as cmd args. As my eval_erasure.py is doing.

On input: --input_filenames and --filenames_order hyp ref gold asr mt OSt OStt. Then, the standard input would be filenames. One tuple per line, in order specified by the --filenames_order parameter.

I don't care much about the technical way how to do implement it or how to name the parameters, if it will be meaningful, without typos, preferably self-explainable, and explained briefly in help message and in README.

For example, it can be analogical to online-text-flow entry points: SLTev erasure [parameters], or SLTev-erasure [parameters].

mohammad2928 commented 3 years ago

Thanks for your great idea, I will use it in the next version, but I do not agree with the whole (multiple main points) and it will increase complexity.

I am going to modify SLTev as follow:

1- Dropping useless information such as delay for MT files. 2- Adding a main point by the name “SLTev-erasure”

SLTev-erasure --inputs <input-paths> --format_orders <input-file-formats> 

For example:

SLTev-erasure --inputs ./test.OSt ./test.asr  --format_orders source asr

Notes: 1- There are the following formats: source or ost (I am not sure.) ref ostt asr slt mt align

2- In general, we have three types of hypotheses: SLT: timestamped online MT hypothesis
MT: finalized MT hypothesis ASR: timestamped ASR hypothesis and finalized ASR transcript

3- This method supports a single or a set of documents
4- To evaluate on ELITR testset, users can use the current version of SLTev.

What do you think about this method? In my opinion, it will satisfy all different types of inputs.

Gldkslfmsd commented 3 years ago

Thanks for your great idea, I will use it in the next version, but I do not agree with the whole (multiple main points) and it will increase complexity.

It increases complexity for developers, but decreases it for users. The tool is extremely complex now, there's a steep learning curve for using it. With multiple entry points the learning could go step by step, which is much smoother. And of course you can leave the option to evaluate all at once, as it is now.

I am going to modify SLTev as follow:

1- Dropping useless information such as

delay for MT files.

I'm not sure whether anybody has ever used it, but I have not. Mostly because I don't know how exactly does it work and whether it really works.

2- Adding a main point by the name “SLTev-erasure”

SLTev-erasure --inputs <input-paths> --format_orders <input-file-formats> 

For example:


SLTev-erasure --inputs ./test.OSt ./test.asr  --format_orders source asr

For erasure you need only the timestamped mt or asr file. Not both. Format order is useless. If a user want to count erasure both on OSt and asr, then he can use the tool twice.



Notes:
1- There are the following formats:
source or ost (I am not sure.)
ref
ostt
asr
slt
mt
align

2- In general, we have three types of hypotheses:
SLT: timestamped online MT hypothesis
MT: finalized MT hypothesis
ASR: timestamped ASR hypothesis and finalized ASR transcript

ASR timestamped and finalized are two. So you have 4 types in total.

3- This method supports a single or a set of documents 4- To evaluate on ELITR testset, users can use the current version of SLTev.

What do you think about this method?

I think it's OK.

In my opinion, it will satisfy all different types of inputs.

Do it and we will see.

mohammad2928 commented 3 years ago

Thanks for your great idea, I will use it in the next version, but I do not agree with the whole (multiple main points) and it will increase complexity.

It increases complexity for developers, but decreases it for users. The tool is extremely complex now, there's a steep learning curve for using it. With multiple entry points the learning could go step by step, which is much smoother. And of course you can leave the option to evaluate all at once, as it is now.

It is not complex for developers if you think having multiple main points decrease complexity for users, I will add 3 main points to the SLTev as follow:

Finalized ASR transcript evaluation:
ASReval --inputs ./test.OSt ./test.asr --format_orders source asr

Timestamped ASR hypothesis evaluation: ASReval --inputs ./test.OSt ./test.asr --format_orders source asrt

Timestamped online MT hypothesis evaluation: SLTeval --inputs ./test.OSt ./test.slt ./test.OStt --format_orders ref slt ostt

Finalized MT hypothesis evaluation: MTeval --inputs ./test.OSt ./test.mt --format_orders ref mt

I'm not sure whether anybody has ever used it, but I have not. Mostly because I don't know how exactly does it work and whether it really works.

You can test it by the sample-data, they are very simple and clean and you can trace it easily.

ASR timestamped and finalized are two. So you have 4 types in total.

Yes, but we can add asrt to the format list instead of it.

Gldkslfmsd commented 3 years ago

OK. We all agree :)

Gldkslfmsd commented 3 years ago

So, Sukanta and Dominik support the idea to have the entry points as above. By default, it should print all the possible scores. With an option, it should print only the listed options.

obo commented 3 years ago

@mohammad2928 Please confirm that this has been implemented. I haven't followed the discussion here but I think it has been all resolved. If yes, please close. Thanks, O.

mohammad2928 commented 3 years ago

Yes, it was been improved.