Closed obo closed 3 years ago
I mention a summary of discussed emails about this issue (with some changes).
Suggestions for the tempeltes of the elitr-testset files:
The source and reference files: < file-name >.< language >.OSt E.g., test-file.cs.OSt, test-file.en.OSt
The OStt files: < file-name >.< language >.OStt E.g., test-file.cs.OStt, test-file.en.OStt
The alignment files (which is generated by our external script which uses Giza++): < file-name >.< source-language >.< target-language >.align E.g., test-file.cs.en.align, test-file.en.cs.align
The output files, i.e., asr, slt and mt files < file-name >.< source-language >.< target-language>.asr/slt/mt E.g., test-file.en.en.asr, test-file.cs.slt, test-file.cs.en.mt
Notes: 1) If more than one file exists, we can add a number to the end of the suffix. E.g., test-file.cs.OSt1, test-file.cs.OSt2, or test-file.en.cs.slt1, test-file.en.cs.slt2
I mention a summary of discussed emails about this issue (with some changes).
Suggestions for the tempeltes of the elitr-testset files:
- The source and reference files: < file-name >.< language >.OSt E.g., test-file.cs.OSt, test-file.en.OSt
- The OStt files: < file-name >.< language >.OStt E.g., test-file.cs.OStt, test-file.en.OStt
- The alignment files (which is generated by our external script which uses Giza++): < file-name >.< source-language >.< target-language >.align E.g., test-file.cs.en.align, test-file.en.cs.align
- The output files, i.e., asr, slt and mt files < file-name >.< source-language >.< target-language>.asr/slt/mt E.g., test-file.en.en.asr, test-file.cs.slt, test-file.cs.en.mt
Notes:
- If more than one file exists, we can add a number to the end of the suffix. E.g., test-file.cs.OSt1, test-file.cs.OSt2, or test-file.en.cs.slt1, test-file.en.cs.slt2
As we agreed before, we decided about the naming. However, I don't agree with the idea behind the numbering. If we put the number at the end of the last extension, it could confuse users and systems about these extensions. For example, a user could consider OSt2 and OSt3 as two different extensions while there are similar. My suggestion is after language extension(s). This preserves the readability of files and doesn't create much complexity in the implementation (As I know). In other words, if there is more than one file for a specific name, we put numbers one block before the main (the last) extension. For example, test-file.cs1.OSt, test-file.cs1.OSt, or test-file.en.cs1.slt, test-file.en.cs2.slt. The last point Do we need another dot to make this separation more clear (e.g., test-file.cs.1.OSt)? My answer is no. I guess Mohammad also prefers the first one (i.e., test-file.cs1.OSt) while he thinks more dots could be a bit confusing and not pretty. What do you think @obo ?
Hello, I'm not sure if this is the correct issue and what is the big picture but I would like to propose simply adding a command line parameter to customize these extensions, as we have some indexes with different extensions and it is very impractical to rename them each time. Could this work, at least as a temporary solution? @Gldkslfmsd @obo
SLTev
--override-source-ext 'en.OSt'
--override-target-ext '.TTcs[\d]+'
Alternatively, I'm generating symbolic links with a script at the moment.
Could this work, at least as a temporary solution?
Not for me. I vote for as much flexibility as possible. I don't even like the restriction that reference and candidate must have the same prefix. I might have document.ref
, document.kit-asr-s2s.txt
vs document.kit-asr-hybrid.txt
, and don't want to link the reference twice. And I don't need the language tags in the names because I work with only one language pair anyway. See #47 .
Alternatively, I'm generating symbolic links with a script at the moment.
My impression is that SLTev should be useful without any wrapper script, unless you have really unique requirements. The work that the wrapper does should be implemented inside SLTev.
Let's revisit this discussion. The big picture is this:
@Gldkslfmsd: Please confirm that the current no-elitr-testset usage is OK for you.
For the repurposing of documents in a given elitr-testset index, I think my proposal is so far the only one that addresses the goal. Please re-read my very top comment in this thread. The summary that Mohammad brought as a summary of suffixes and what Ebrahim added for individual reference translations is all OK but does not address re-purposing. What Matus proposed seems nice but would require users of elitr-testset to know the repurposing, which is a complications.
So I tend to conclude:
Considering the issue below:
https://github.com/ELITR/SLTev/issues/57#issue-824980509
we need another extension for the reference translation file.
I'm implementing a functionality into the Pipeliner to process files with a given pipeline, so the pipeline and it's components can be evaluated automatically. Naturally, SLTev will be used for the actual evaluation process. The files to be evaluated will be defined by an index file, but there has to be a way to specify which files from the index are the source files and which are the references files and relying on the extensions only is very fragile.
After discussing this issue with Ondrej, we came to this conclusion:
source
and reference
suffixes, as outlined in Ondrej's comment. An index can contain multiple such sections (!). source
, ref
) pairs. So just to reiterate, here's how the index file is probably going to look (syntax can change)
# SRC: *.en.OSt
# REF: *.cs.OSt
wmt18-newstest-sample-read
# SRC: *.en
# REF: *.cs
different-directory-with-different-suffixes
Hi @srdecny, Sorry for the delayed response.
This method is nice but there are some challenges that should be solved: 1- In order to evaluate SLT files, we need .OStt and .align files in addition to source and reference files. So we should specify them in the index file and index_parser recognize them. 2- We need to know the language of the source and target to match candidate and input (sec, ref, OStt, ...) files (we should assign input files (src, ref, OStt, ...) to the candidate file/files (system output). ). 3- We need a policy to use user files next to indexed files (all of them in one folder).
Hi @mohammad2928, thanks for the suggestions.
1 - There are two options we can take.
First one, to make it more general, perhaps allow any meta-annotations, such as # ALIGNMENT *.align
, etc. The parser will then have to return a more complicated data structure (instead of the <SOURCE, REF> tuple as it does now), because there can be many more additional meta-files needed for the evaluation. I'm leaning towards JSON, that one should be easy to pass around.
Second option is to assume the extension of these additional files won't change (and only the extension of the source and reference files does), and the parser would simply report a comma-separated line, like so: source, reference, ostt file, alignment file
etc, assuming the additional files have the same name (sans the extension) as the source and reference files. This will be slightly easier to implement and parse, but at the loss of the flexibility when we need to introduce another type of a file for evaluation.
I'm leaning slightly towards the first option, but it's certainly possible the latter option will do just fine for us without making things complicated. @obo , what do you think about this?
2 - I believe this issue could be solved by the more verbose annotations above, is that correct?
Hi,
I definitely vote for option 1, adding additional flavors of the document:
ALIGNMENT: ... TERMS_DICT: ...
Thanks, O.
----- Original Message -----
From: "Vojtěch Srdečný" @.> To: "ELITR" @.> Cc: "Ondrej Bojar" @.>, "Mention" @.> Sent: Tuesday, 23 March, 2021 11:12:42 Subject: Re: [ELITR/SLTev] suffix specifiers in indices (#19)
Hi @mohammad2928, thanks for the suggestions.
1 - There are two options we can take.
First one, to make it more general, perhaps allow any meta-annotations, such as
# ALIGNMENT *.align
, etc. The parser will then have to return a more complicated data structure (instead of the <SOURCE, REF> tuple as it does now), because there can be many more additional meta-files needed for the evaluation. I'm leaning towards JSON, that one should be easy to pass around.Second option is to assume the extension of these additional files won't change (and only the extension of the source and reference files does), and the parser would simply report a comma-separated line, like so:
source, reference, ostt file, alignment file
etc, assuming the additional files have the same name (sans the extension) as the source and reference files. This will be slightly easier to implement and parse, but at the loss of the flexibility when we need to introduce another type of a file for evaluation.I'm leaning slightly towards the first option, but it's certainly possible the latter option will do just fine for us without making things complicated. @obo , what do you think about this?
2 - I believe this issue could be solved by the more verbose annotations above, is that correct?
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/ELITR/SLTev/issues/19#issuecomment-804781312
-- Ondrej Bojar @. / @.) http://www.cuni.cz/~obo
I've implemented option 1. See the README and index_parser.py
for details.
Ondrej asks Ebrahim to propose several solutions to the problem of indicating what are the input and reference files.
The exact problem is this:
We need an approach to indicate which is the source and which is the reference document so that it works across all indices and all documents.
Imagine that you want an index for en2cs MT and you want it to include:
My proposal: Make index interleaved with ‘suffix pair specifiers’:
Whatever follows the line with a ‘suffix pair specifier’ (SRC->REF) will be interpreted according to the suffix pair specifier.
If we want more file types to be considered (quite likely), we could have more suffix specifiers, not just a pair: