Open johann-petrak opened 7 months ago
The dump is supposed to contain only the files related to the current state of the competition.
Can you please share the URL to the competition so we can investigate this?
This is the one where I tried this: https://www.codabench.org/competitions/2618/?secret_key=74ac0264-1d23-4148-8f9a-dcdb66e08ab3
I am looking at the bundle and this are my thoughts.
I don't see duplicate files. The file names are actually different (see screenshot below). I think the file system does not even allow to have several files with the same name.
Also, you have several scoring programs, several references data, etc. because each is associated to a different phase/task of the competition. It is clearly indicated in the competition.yaml
file:
tasks:
- index: 0
name: ST1-Closed
description: Subtask 1, Closed Track
is_public: false
input_data: input_data-6223.zip
reference_data: reference_data-6223.zip
scoring_program: scoring_program-6223.zip
- index: 1
name: ST1-Closed
description: Subtask 1, Closed Track
is_public: false
input_data: input_data-6224.zip
reference_data: reference_data-6224.zip
scoring_program: scoring_program-6224.zip
- index: 2
name: ST1-Closed
description: Subtask 1, Closed Track
is_public: false
input_data: input_data-6224.zip
reference_data: reference_data-6224.zip
scoring_program: scoring_program-6224.zip
Why do you think the scoring programs come from a different project? By investigating it, everything looks fine to me.
I am not sure which file you have downloaded or which you are seeing as the user you used.
I see the following list (the last one I have created via the "Dump with files" option:
Sadly the name in the list does not indicate which option was used to create it, and the name in the list is not used as the name when downloading the file. When downloading the file, it gets saved as "GermEval_GERMS-DETECT1-Closed-2024-.zip"
Here is the content of this zip file:
$ unzip -l GermEval_GERMS-DETECT1-Closed-2024-.zip
Archive: GermEval_GERMS-DETECT1-Closed-2024-.zip
Length Date Time Name
--------- ---------- ----- ----
3458 2024-04-11 09:48 logo.png
265 2024-04-11 09:48 terms.md
542 2024-04-11 09:48 participation-12250.md
22 2024-04-11 09:48 input_data-6223.zip
40827 2024-04-11 09:48 reference_data-6223.zip
6463 2024-04-11 09:48 scoring_program-6223.zip
22 2024-04-11 09:48 input_data-6224.zip
192 2024-04-11 09:48 reference_data-6224.zip
6463 2024-04-11 09:48 scoring_program-6224.zip
22 2024-04-11 09:48 input_data-6224.zip
192 2024-04-11 09:48 reference_data-6224.zip
6463 2024-04-11 09:48 scoring_program-6224.zip
2619 2024-04-11 09:48 competition.yaml
--------- -------
67550 13 files
As you can see there are two files called "reference_data-6224.zip" and other files which are present in the zip archive multiple times and lead to prompts when unzipping:
The part of the competition.yaml file you quoted actually shows that the file name reference_data-6224.zip is used for two different tasks and these are included as two different files with the same name in the zip file. Note that originally, the competition.yaml file does NOT specify a zip file but a path, and the paths for task with index 1 and task with index 2 are different:
- index: 1
name: ST${subtask}-${track}
is_public: true
description: Subtask ${subtask}, ${track} Track
reference_data: dev_phase/reference_data/
scoring_program: scoring_program/
input_data: input_data/
# ingestion_program: ingestion_program/
- index: 2
name: ST${subtask}-${track}
is_public: true
description: Subtask ${subtask}, ${track} Track
reference_data: comp_phase/reference_data/
scoring_program: scoring_program/
input_data: input_data/
# ingestion_program: ingestion_program/
The confusion where I believed that the file comes from a different project was my mistake because in that version of the bundle I copy/pasted the wrong file into the reference data directory.
So there is definitely the problem that when dump with files is used, files with identical names are used to archive different data from different subdirectories.
with the unzip command you are using, I see reference_data-6224
two times
but when you check the folder you can see just one file
If a file with the same name already exists, normally it is overwritten.
@ihsaan-ullah I do not understand what you mean: clearly the zip archive I download contains duplicate files. Most versions of zip when unzipping such an archive, will prompt before overriding a file with the same name, if there is no prompt this would be even worse! But the main thing here is that the files with identical names refer to different files in the original bundle and should definitely not have the same name! Why would anyone deliberately place two files with identical filenames into a zip archive? It does not make any sense to me.
I think now I see a problem.
To make things more clear. What do you expect?
From the competition, I see you have 3 pahses but they share the same task so there should be one set of {ingestion, scoring, input, reference}. But in your yaml file from the dump, I see 3 tasks(wrong) and 3 phases (right but one phase uses task 0 and other uses task 2, task 1 is not used)
When I unzip both reference data (6223 and 6224) they both have different content but the file name is same, this is also weird because the task is same for all the phases
When I unzip both reference data (6223 and 6224) they both have different content but the file name is same, this is also weird because the task is same for all the phases
This is not weird since in the bundle I originally uploaded, the data comes from different subdirectories in the different phases. The original competition.yaml file uploaded to create this project contained something like this:
phases:
- name: Setup Phase
description: 'Internal phase for developing and debugging the competition code'
start: 2024-03-01
end: 2024-04-11
max_submissions_per_day: 1000
max_submissions: 1000
execution_time_limit: 60
solutions: []
tasks:
- 0
- name: Trial Phase
description: 'Trial phase: Try everything out with a preliminary small training and test set'
start: 2024-04-12
end: 2024-04-30
max_submissions_per_day: 20
max_submissions: 20
execution_time_limit: 60
solutions: []
# public_data: public_data
# starting_kit: starting_kit
tasks:
- 0
- name: Development Phase
description: 'Development phase: Try everything out, train on training data, evaluate on dev data.'
start: 2024-05-01
end: 2024-06-06
max_submissions_per_day: 1000
max_submissions: 1000
execution_time_limit: 60
solutions: []
# public_data: public_data
# starting_kit: starting_kit
tasks:
- 1
So it may be important to state the the zip files included here are not MY zip files but something the system has created after I uploaded the bundle with a completely different directory structure.
To make things more clear. What do you expect?
First off, I would like to understand what the purpose of the different dumps is: is this documented anywhere? What I find strange is that what I get in these files is or should be basically the same as what I get from the competition bundle file, but in a different format?
I would have expected that I will get all the submission data, scores files etc as well in these files but apparently this is not the case.
Where do I get, btw, as an admin, all the submissions to my competition and the corresponding scores and metadata?
The dump is created from the current state of the competition (without submissions and their scores).
To get the submissions, you have another tab close to the dumps tab where you can see all the submissions, their outputs
OK, but in that tab I still cannot see how I can download ALL submission data ...
I seem to be able to download either just a csv with the LIST of the submissions or each submission individually. What I would need is to load ALL submission data for a phase at the same time because how else can I find out the proper ranking of submissions? Note that the leaderboard is not suitable for doing this as it may contain invalid submissions and it has no way to properly rank according to statistical significance or to give the same rank to equivalent results. So I absolutely NEED the complete set of submissions that were created during the evaluation phase for download and on-site processing.
We don't have an option to download all phase submissions at the moment
oh - this is really terrible. How are competition organizers supposed to do this when there are hundreds of submissions? Is this possible in codalab? Is it possible to use the bundle I have created so far on codalab? We are really panicking right now with the limitations we have detected which really make it hard to imagine how to run the competition using codabench
But to get back to the original topic: the issue here seems to be that the task index numbers in the dump with files are also different from the index numbers in the original competition.yaml file.
I created a new competition https://www.codabench.org/competitions/2655/?secret_key=f0d02577-4921-46f7-9313-c7706d15ff87 which has the following original competition.yaml tasks definition:
tasks:
- index: 0
name: NEW02PTST1-Closed
is_public: true
description: Subtask 1, Closed Track
reference_data: trial_phase/reference_data/
scoring_program: scoring_program/
input_data: input_data/
# ingestion_program: ingestion_program/
- index: 1
name: NEW02PDST1-Closed
is_public: true
description: Subtask 1, Closed Track
reference_data: dev_phase/reference_data/
scoring_program: scoring_program/
input_data: input_data/
# ingestion_program: ingestion_program/
- index: 2
name: NEW02PCST1-Closed
is_public: true
description: Subtask 1, Closed Track
reference_data: comp_phase/reference_data/
scoring_program: scoring_program/
input_data: input_data/
# ingestion_program: ingestion_program/
As you can see there are three tasks which share the scoring program and input data path, but use different reference data directories (at the moment the files in those directories are identical but with the real competition they will be different). Then each of the 4 phases uses one of the 3 tasks:
phases:
- name: Setup Phase
description: 'Internal phase for developing and debugging the competition code'
start: 2024-03-01
end: 2024-04-12
max_submissions_per_day: 1000
max_submissions: 1000
execution_time_limit: 60
solutions: []
tasks:
- 0
- name: Trial Phase
description: 'Trial phase: Try everything out with a preliminary small training and test set'
start: 2024-04-12
end: 2024-04-30
max_submissions_per_day: 20
max_submissions: 20
execution_time_limit: 60
solutions: []
# public_data: public_data
# starting_kit: starting_kit
tasks:
- 0
- name: Development Phase
description: 'Development phase: Try everything out, train on training data, evaluate on dev data.'
start: 2024-05-01
end: 2024-06-06
max_submissions_per_day: 1000
max_submissions: 1000
execution_time_limit: 60
solutions: []
# public_data: public_data
# starting_kit: starting_kit
tasks:
- 1
- name: Competition Phase
description: 'Competition phase: Train on training+dev data, submit your best predictions on the test set.'
start: 2024-06-07
end: 2024-06-25
max_submissions_per_day: 3
max_submissions: 100
execution_time_limit: 60
solutions: []
# public_data: public_data
# starting_kit: starting_kit
tasks:
- 2
- name: Post Competition Phase
description: 'Post Competition phase: evaluate continually improving models.'
start: 2024-06-26
max_submissions_per_day: 3
max_submissions: 200
execution_time_limit: 60
solutions: []
# public_data: public_data
# starting_kit: starting_kit
tasks:
- 2
I do not think there is anything wrong with this so far.
From this, the dump with files program generates the following tasks and phases section:
tasks:
- index: 0
name: NEW02PTST1-Closed
description: Subtask 1, Closed Track
is_public: false
input_data: input_data-6318.zip
reference_data: reference_data-6318.zip
scoring_program: scoring_program-6318.zip
- index: 1
name: NEW02PTST1-Closed
description: Subtask 1, Closed Track
is_public: false
input_data: input_data-6318.zip
reference_data: reference_data-6318.zip
scoring_program: scoring_program-6318.zip
- index: 2
name: NEW02PDST1-Closed
description: Subtask 1, Closed Track
is_public: false
input_data: input_data-6319.zip
reference_data: reference_data-6319.zip
scoring_program: scoring_program-6319.zip
- index: 3
name: NEW02PCST1-Closed
description: Subtask 1, Closed Track
is_public: false
input_data: input_data-6320.zip
reference_data: reference_data-6320.zip
scoring_program: scoring_program-6320.zip
- index: 4
name: NEW02PCST1-Closed
description: Subtask 1, Closed Track
is_public: false
input_data: input_data-6320.zip
reference_data: reference_data-6320.zip
scoring_program: scoring_program-6320.zip
solutions: []
phases:
- index: 0
name: Setup Phase
description: Internal phase for developing and debugging the competition code
start: '2024-03-01'
end: '2024-04-11'
max_submissions_per_day: 1000
max_submissions: 1000
execution_time_limit: 60
auto_migrate_to_this_phase: false
hide_output: false
tasks:
- 1
solutions: []
- index: 1
name: Trial Phase
description: 'Trial phase: Try everything out with a preliminary small training
and test set'
start: '2024-04-12'
end: '2024-04-30'
max_submissions_per_day: 20
max_submissions: 20
execution_time_limit: 60
auto_migrate_to_this_phase: false
hide_output: false
tasks:
- 1
solutions: []
- index: 2
name: Development Phase
description: 'Development phase: Try everything out, train on training data, evaluate
on dev data.'
start: '2024-05-01'
end: '2024-06-06'
max_submissions_per_day: 1000
max_submissions: 1000
execution_time_limit: 60
auto_migrate_to_this_phase: false
hide_output: false
tasks:
- 2
solutions: []
- index: 3
name: Competition Phase
description: 'Competition phase: Train on training+dev data, submit your best predictions
on the test set.'
start: '2024-06-07'
end: '2024-06-25'
max_submissions_per_day: 3
max_submissions: 100
execution_time_limit: 60
auto_migrate_to_this_phase: false
hide_output: false
tasks:
- 4
solutions: []
- index: 4
name: Post Competition Phase
description: 'Post Competition phase: evaluate continually improving models.'
start: '2024-06-26'
max_submissions_per_day: 3
max_submissions: 200
execution_time_limit: 60
auto_migrate_to_this_phase: false
hide_output: false
tasks:
- 4
solutions: []
Suddenly there are 5 tasks but the phases do not even use 3 of those (with index 0 , 3)! And the zip files for some of the tasks do have identical names: it seems this happens with pairs of tasks where the index of one is not used but the index of the other one is used. I do not understand why this task index change is happening, nor why those files with identical names are placed in the dump file but I am sure, that is not how it should work
You should be able to verify all this by yourself by downloading either the competition bundle or the dump with files dump files.
I have created a competition with input data, scoring programs from uploading a bundle file and there are also already test submissions.
When I then try to download using the "Dump with files" option I see the following oddities/problems:
What is going on here, should the dump not be restricted to those data/program files which are actually part of the project?
And if I upload a bundle that contains e.g. a target file or scorer and then remove the project, should those files and scorers not get removed with the project they belong to?
UPDATE: to avoid confusion, after further investigation, the problem here is only that the chose file names are duplicates, not the content, this was my mistake.