obo commented 3 years ago

Ondrej asks Ebrahim to propose several solutions to the problem of indicating what are the input and reference files.

The exact problem is this:

a given index specifies a set of documents
each document comes in multiple versions (language variants, modalities, …)
a single document can come in several modalities that allow for different uses, see e.g. https://github.com/ELITR/elitr-testset/tree/master/documents/wmt18-newstest-sample-read and the related issue 11 there (https://github.com/ELITR/elitr-testset/issues/11).

We need an approach to indicate which is the source and which is the reference document so that it works across all indices and all documents.

Imagine that you want an index for en2cs MT and you want it to include:

wmt18-newstest-sample-read (28 documents of the same kind)
confidential/amalach-sample-interview (1 document as of now) These two collection of documents use different suffixes, for a good reason.

My proposal: Make index interleaved with ‘suffix pair specifiers’:

# index for MT from EN to CS

# files from one document collection:
# SRC->REF: *.en.OSt -> *.cs.OSt
wmt18-newstest-sample-read

# files from another document collection
# SRC->REF: *.en -> *.cs
confidential/amalach-sample-interview

Whatever follows the line with a ‘suffix pair specifier’ (SRC->REF) will be interpreted according to the suffix pair specifier.

If we want more file types to be considered (quite likely), we could have more suffix specifiers, not just a pair:

# An index for MT evaluation with a metric that further focuses on some 'dictionary' scoring:
# SRC: *.en.OSt
# REF: *.cs.OSt
# REFDICT: *.cs.dictionary
wmt18-newstest-sample-read

mohammad2928 commented 3 years ago

I mention a summary of discussed emails about this issue (with some changes).

Suggestions for the tempeltes of the elitr-testset files:

The source and reference files: < file-name >.< language >.OSt E.g., test-file.cs.OSt, test-file.en.OSt
The OStt files: < file-name >.< language >.OStt E.g., test-file.cs.OStt, test-file.en.OStt
The alignment files (which is generated by our external script which uses Giza++): < file-name >.< source-language >.< target-language >.align E.g., test-file.cs.en.align, test-file.en.cs.align
The output files, i.e., asr, slt and mt files < file-name >.< source-language >.< target-language>.asr/slt/mt E.g., test-file.en.en.asr, test-file.cs.slt, test-file.cs.en.mt

Notes: 1) If more than one file exists, we can add a number to the end of the suffix. E.g., test-file.cs.OSt1, test-file.cs.OSt2, or test-file.en.cs.slt1, test-file.en.cs.slt2

eebism commented 3 years ago

I mention a summary of discussed emails about this issue (with some changes).

Suggestions for the tempeltes of the elitr-testset files:

The source and reference files: < file-name >.< language >.OSt E.g., test-file.cs.OSt, test-file.en.OSt

The OStt files: < file-name >.< language >.OStt E.g., test-file.cs.OStt, test-file.en.OStt

The alignment files (which is generated by our external script which uses Giza++): < file-name >.< source-language >.< target-language >.align E.g., test-file.cs.en.align, test-file.en.cs.align

The output files, i.e., asr, slt and mt files < file-name >.< source-language >.< target-language>.asr/slt/mt E.g., test-file.en.en.asr, test-file.cs.slt, test-file.cs.en.mt

Notes:

If more than one file exists, we can add a number to the end of the suffix. E.g., test-file.cs.OSt1, test-file.cs.OSt2, or test-file.en.cs.slt1, test-file.en.cs.slt2

As we agreed before, we decided about the naming. However, I don't agree with the idea behind the numbering. If we put the number at the end of the last extension, it could confuse users and systems about these extensions. For example, a user could consider OSt2 and OSt3 as two different extensions while there are similar. My suggestion is after language extension(s). This preserves the readability of files and doesn't create much complexity in the implementation (As I know). In other words, if there is more than one file for a specific name, we put numbers one block before the main (the last) extension. For example, test-file.cs1.OSt, test-file.cs1.OSt, or test-file.en.cs1.slt, test-file.en.cs2.slt. The last point Do we need another dot to make this separation more clear (e.g., test-file.cs.1.OSt)? My answer is no. I guess Mohammad also prefers the first one (i.e., test-file.cs1.OSt) while he thinks more dots could be a bit confusing and not pretty. What do you think @obo ?

mzilinec commented 3 years ago

Hello, I'm not sure if this is the correct issue and what is the big picture but I would like to propose simply adding a command line parameter to customize these extensions, as we have some indexes with different extensions and it is very impractical to rename them each time. Could this work, at least as a temporary solution? @Gldkslfmsd @obo

SLTev
  --override-source-ext 'en.OSt'
  --override-target-ext '.TTcs[\d]+'

Alternatively, I'm generating symbolic links with a script at the moment.

Gldkslfmsd commented 3 years ago

Could this work, at least as a temporary solution?

Not for me. I vote for as much flexibility as possible. I don't even like the restriction that reference and candidate must have the same prefix. I might have document.ref, document.kit-asr-s2s.txt vs document.kit-asr-hybrid.txt, and don't want to link the reference twice. And I don't need the language tags in the names because I work with only one language pair anyway. See #47 .

Alternatively, I'm generating symbolic links with a script at the moment.

My impression is that SLTev should be useful without any wrapper script, unless you have really unique requirements. The work that the wrapper does should be implemented inside SLTev.

obo commented 3 years ago

Let's revisit this discussion. The big picture is this:

SLTev without elitr-testset should be as flexible as possible, as Dominik asks; the user should be free to provide any file name for any suffix
SLTev with elitr-testset should be simple and yet very flexible because the same document (with some strange suffix) can serve multiple purposes in different indices.

@Gldkslfmsd: Please confirm that the current no-elitr-testset usage is OK for you.

For the repurposing of documents in a given elitr-testset index, I think my proposal is so far the only one that addresses the goal. Please re-read my very top comment in this thread. The summary that Mohammad brought as a summary of suffixes and what Ebrahim added for individual reference translations is all OK but does not address re-purposing. What Matus proposed seems nice but would require users of elitr-testset to know the repurposing, which is a complications.

So I tend to conclude:

keep the flexibility for Dominik (which we hopefully have)
add the interpretation of the suffixes in elitr-testset indices
definitely make sure SLTev is verbose (it is, AFAIK) about which file is being evaluated against which.

eebism commented 3 years ago

Considering the issue below: https://github.com/ELITR/SLTev/issues/57#issue-824980509 we need another extension for the reference translation file. stands for "Original Song Transcript," and we can't use it as our reference translation file. My suggestion is to have < file-name >.< language >.ref to avoid ambiguity. What do you think @mohammad2928 and @obo?

srdecny commented 3 years ago

I'm implementing a functionality into the Pipeliner to process files with a given pipeline, so the pipeline and it's components can be evaluated automatically. Naturally, SLTev will be used for the actual evaluation process. The files to be evaluated will be defined by an index file, but there has to be a way to specify which files from the index are the source files and which are the references files and relying on the extensions only is very fragile.

After discussing this issue with Ondrej, we came to this conclusion:

Index files will contain a section with meta-informations about the source and reference suffixes, as outlined in Ondrej's comment. An index can contain multiple such sections (!).
I will implement a simple parser of these index files with meta informations. The parser will consume the index file and will emit (source, ref) pairs.
This parser will live in SLTev
Other users and tools (such as pipeliner) can call this parser externally, and SLTev should call it internally.

srdecny commented 3 years ago

So just to reiterate, here's how the index file is probably going to look (syntax can change)

# SRC: *.en.OSt
# REF: *.cs.OSt
wmt18-newstest-sample-read

# SRC: *.en
# REF: *.cs
different-directory-with-different-suffixes

mohammad2928 commented 3 years ago

Hi @srdecny, Sorry for the delayed response.

This method is nice but there are some challenges that should be solved: 1- In order to evaluate SLT files, we need .OStt and .align files in addition to source and reference files. So we should specify them in the index file and index_parser recognize them. 2- We need to know the language of the source and target to match candidate and input (sec, ref, OStt, ...) files (we should assign input files (src, ref, OStt, ...) to the candidate file/files (system output). ). 3- We need a policy to use user files next to indexed files (all of them in one folder).

srdecny commented 3 years ago

Hi @mohammad2928, thanks for the suggestions.

1 - There are two options we can take.

First one, to make it more general, perhaps allow any meta-annotations, such as # ALIGNMENT *.align, etc. The parser will then have to return a more complicated data structure (instead of the <SOURCE, REF> tuple as it does now), because there can be many more additional meta-files needed for the evaluation. I'm leaning towards JSON, that one should be easy to pass around.

Second option is to assume the extension of these additional files won't change (and only the extension of the source and reference files does), and the parser would simply report a comma-separated line, like so: source, reference, ostt file, alignment file etc, assuming the additional files have the same name (sans the extension) as the source and reference files. This will be slightly easier to implement and parse, but at the loss of the flexibility when we need to introduce another type of a file for evaluation.

I'm leaning slightly towards the first option, but it's certainly possible the latter option will do just fine for us without making things complicated. @obo , what do you think about this?

2 - I believe this issue could be solved by the more verbose annotations above, is that correct?

obo commented 3 years ago

Hi,

I definitely vote for option 1, adding additional flavors of the document:

ALIGNMENT: ... TERMS_DICT: ...

this terms dictionary could e.g. mention just a few key words in the target side that really have to be produced by the system.

Thanks, O.

----- Original Message -----

From: "Vojtěch Srdečný" @.> To: "ELITR" @.> Cc: "Ondrej Bojar" @.>, "Mention" @.> Sent: Tuesday, 23 March, 2021 11:12:42 Subject: Re: [ELITR/SLTev] suffix specifiers in indices (#19)

Hi @mohammad2928, thanks for the suggestions.

1 - There are two options we can take.

First one, to make it more general, perhaps allow any meta-annotations, such as # ALIGNMENT *.align, etc. The parser will then have to return a more complicated data structure (instead of the <SOURCE, REF> tuple as it does now), because there can be many more additional meta-files needed for the evaluation. I'm leaning towards JSON, that one should be easy to pass around.

Second option is to assume the extension of these additional files won't change (and only the extension of the source and reference files does), and the parser would simply report a comma-separated line, like so: source, reference, ostt file, alignment file etc, assuming the additional files have the same name (sans the extension) as the source and reference files. This will be slightly easier to implement and parse, but at the loss of the flexibility when we need to introduce another type of a file for evaluation.

I'm leaning slightly towards the first option, but it's certainly possible the latter option will do just fine for us without making things complicated. @obo , what do you think about this?

2 - I believe this issue could be solved by the more verbose annotations above, is that correct?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/ELITR/SLTev/issues/19#issuecomment-804781312

-- Ondrej Bojar @. / @.) http://www.cuni.cz/~obo

srdecny commented 3 years ago

I've implemented option 1. See the README and index_parser.py for details.

ELITR / SLTev

suffix specifiers in indices #19

this terms dictionary could e.g. mention just a few key words in the target side that really have to be produced by the system.