codalab / codabench

Codabench is a flexible, easy-to-use and reproducible benchmarking platform. Check our paper at Patterns Cell Press https://hubs.li/Q01fwRWB0
Apache License 2.0
74 stars 28 forks source link

The "Dump with files" option produces duplicate files #1393

Open johann-petrak opened 7 months ago

johann-petrak commented 7 months ago

I have created a competition with input data, scoring programs from uploading a bundle file and there are also already test submissions.

When I then try to download using the "Dump with files" option I see the following oddities/problems:

What is going on here, should the dump not be restricted to those data/program files which are actually part of the project?

And if I upload a bundle that contains e.g. a target file or scorer and then remove the project, should those files and scorers not get removed with the project they belong to?

UPDATE: to avoid confusion, after further investigation, the problem here is only that the chose file names are duplicates, not the content, this was my mistake.

Didayolo commented 6 months ago

The dump is supposed to contain only the files related to the current state of the competition.

Can you please share the URL to the competition so we can investigate this?

johann-petrak commented 6 months ago

This is the one where I tried this: https://www.codabench.org/competitions/2618/?secret_key=74ac0264-1d23-4148-8f9a-dcdb66e08ab3

Didayolo commented 6 months ago

I am looking at the bundle and this are my thoughts.

Duplicate files

I don't see duplicate files. The file names are actually different (see screenshot below). I think the file system does not even allow to have several files with the same name.

Capture d’écran 2024-04-10 à 14 35 49

Also, you have several scoring programs, several references data, etc. because each is associated to a different phase/task of the competition. It is clearly indicated in the competition.yaml file:

tasks:
- index: 0
  name: ST1-Closed
  description: Subtask 1, Closed Track
  is_public: false
  input_data: input_data-6223.zip
  reference_data: reference_data-6223.zip
  scoring_program: scoring_program-6223.zip
- index: 1
  name: ST1-Closed
  description: Subtask 1, Closed Track
  is_public: false
  input_data: input_data-6224.zip
  reference_data: reference_data-6224.zip
  scoring_program: scoring_program-6224.zip
- index: 2
  name: ST1-Closed
  description: Subtask 1, Closed Track
  is_public: false
  input_data: input_data-6224.zip
  reference_data: reference_data-6224.zip
  scoring_program: scoring_program-6224.zip

Files from different project

Why do you think the scoring programs come from a different project? By investigating it, everything looks fine to me.

johann-petrak commented 6 months ago

I am not sure which file you have downloaded or which you are seeing as the user you used.

I see the following list (the last one I have created via the "Dump with files" option: dump01

Sadly the name in the list does not indicate which option was used to create it, and the name in the list is not used as the name when downloading the file. When downloading the file, it gets saved as "GermEval_GERMS-DETECT1-Closed-2024-.zip"

Here is the content of this zip file:

$ unzip -l GermEval_GERMS-DETECT1-Closed-2024-.zip 
Archive:  GermEval_GERMS-DETECT1-Closed-2024-.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     3458  2024-04-11 09:48   logo.png
      265  2024-04-11 09:48   terms.md
      542  2024-04-11 09:48   participation-12250.md
       22  2024-04-11 09:48   input_data-6223.zip
    40827  2024-04-11 09:48   reference_data-6223.zip
     6463  2024-04-11 09:48   scoring_program-6223.zip
       22  2024-04-11 09:48   input_data-6224.zip
      192  2024-04-11 09:48   reference_data-6224.zip
     6463  2024-04-11 09:48   scoring_program-6224.zip
       22  2024-04-11 09:48   input_data-6224.zip
      192  2024-04-11 09:48   reference_data-6224.zip
     6463  2024-04-11 09:48   scoring_program-6224.zip
     2619  2024-04-11 09:48   competition.yaml
---------                     -------
    67550                     13 files

As you can see there are two files called "reference_data-6224.zip" and other files which are present in the zip archive multiple times and lead to prompts when unzipping:

dump02

The part of the competition.yaml file you quoted actually shows that the file name reference_data-6224.zip is used for two different tasks and these are included as two different files with the same name in the zip file. Note that originally, the competition.yaml file does NOT specify a zip file but a path, and the paths for task with index 1 and task with index 2 are different:

  - index: 1
    name: ST${subtask}-${track}
    is_public: true
    description: Subtask ${subtask}, ${track} Track
    reference_data: dev_phase/reference_data/
    scoring_program: scoring_program/
    input_data: input_data/
    # ingestion_program: ingestion_program/
  - index: 2
    name: ST${subtask}-${track}
    is_public: true
    description: Subtask ${subtask}, ${track} Track
    reference_data: comp_phase/reference_data/
    scoring_program: scoring_program/
    input_data: input_data/
    # ingestion_program: ingestion_program/

The confusion where I believed that the file comes from a different project was my mistake because in that version of the bundle I copy/pasted the wrong file into the reference data directory.

So there is definitely the problem that when dump with files is used, files with identical names are used to archive different data from different subdirectories.

ihsaan-ullah commented 6 months ago

with the unzip command you are using, I see reference_data-6224 two times

Screenshot 2024-04-11 at 4 21 11 PM

but when you check the folder you can see just one file

Screenshot 2024-04-11 at 4 21 55 PM

If a file with the same name already exists, normally it is overwritten.

johann-petrak commented 6 months ago

@ihsaan-ullah I do not understand what you mean: clearly the zip archive I download contains duplicate files. Most versions of zip when unzipping such an archive, will prompt before overriding a file with the same name, if there is no prompt this would be even worse! But the main thing here is that the files with identical names refer to different files in the original bundle and should definitely not have the same name! Why would anyone deliberately place two files with identical filenames into a zip archive? It does not make any sense to me.

ihsaan-ullah commented 6 months ago

I think now I see a problem.

To make things more clear. What do you expect?

From the competition, I see you have 3 pahses but they share the same task so there should be one set of {ingestion, scoring, input, reference}. But in your yaml file from the dump, I see 3 tasks(wrong) and 3 phases (right but one phase uses task 0 and other uses task 2, task 1 is not used)

When I unzip both reference data (6223 and 6224) they both have different content but the file name is same, this is also weird because the task is same for all the phases

johann-petrak commented 6 months ago

When I unzip both reference data (6223 and 6224) they both have different content but the file name is same, this is also weird because the task is same for all the phases

This is not weird since in the bundle I originally uploaded, the data comes from different subdirectories in the different phases. The original competition.yaml file uploaded to create this project contained something like this:

phases:
  - name: Setup Phase
    description: 'Internal phase for developing and debugging the competition code'
    start: 2024-03-01
    end: 2024-04-11
    max_submissions_per_day: 1000
    max_submissions: 1000
    execution_time_limit: 60
    solutions: []
    tasks:
      - 0
  - name: Trial Phase
    description: 'Trial phase: Try everything out with a preliminary small training and test set'
    start: 2024-04-12
    end: 2024-04-30
    max_submissions_per_day: 20
    max_submissions: 20
    execution_time_limit: 60
    solutions: []
    # public_data: public_data
    # starting_kit: starting_kit
    tasks:
      - 0
  - name: Development Phase
    description: 'Development phase: Try everything out, train on training data, evaluate on dev data.'
    start: 2024-05-01
    end: 2024-06-06
    max_submissions_per_day: 1000
    max_submissions: 1000
    execution_time_limit: 60
    solutions: []
    # public_data: public_data
    # starting_kit: starting_kit
    tasks:
      - 1

So it may be important to state the the zip files included here are not MY zip files but something the system has created after I uploaded the bundle with a completely different directory structure.

johann-petrak commented 6 months ago

To make things more clear. What do you expect?

First off, I would like to understand what the purpose of the different dumps is: is this documented anywhere? What I find strange is that what I get in these files is or should be basically the same as what I get from the competition bundle file, but in a different format?

I would have expected that I will get all the submission data, scores files etc as well in these files but apparently this is not the case.

Where do I get, btw, as an admin, all the submissions to my competition and the corresponding scores and metadata?

ihsaan-ullah commented 6 months ago

The dump is created from the current state of the competition (without submissions and their scores).

To get the submissions, you have another tab close to the dumps tab where you can see all the submissions, their outputs

johann-petrak commented 6 months ago

OK, but in that tab I still cannot see how I can download ALL submission data ...

I seem to be able to download either just a csv with the LIST of the submissions or each submission individually. What I would need is to load ALL submission data for a phase at the same time because how else can I find out the proper ranking of submissions? Note that the leaderboard is not suitable for doing this as it may contain invalid submissions and it has no way to properly rank according to statistical significance or to give the same rank to equivalent results. So I absolutely NEED the complete set of submissions that were created during the evaluation phase for download and on-site processing.

ihsaan-ullah commented 6 months ago

We don't have an option to download all phase submissions at the moment

johann-petrak commented 6 months ago

oh - this is really terrible. How are competition organizers supposed to do this when there are hundreds of submissions? Is this possible in codalab? Is it possible to use the bundle I have created so far on codalab? We are really panicking right now with the limitations we have detected which really make it hard to imagine how to run the competition using codabench

johann-petrak commented 6 months ago

But to get back to the original topic: the issue here seems to be that the task index numbers in the dump with files are also different from the index numbers in the original competition.yaml file.

I created a new competition https://www.codabench.org/competitions/2655/?secret_key=f0d02577-4921-46f7-9313-c7706d15ff87 which has the following original competition.yaml tasks definition:

tasks:
  - index: 0
    name: NEW02PTST1-Closed
    is_public: true
    description: Subtask 1, Closed Track
    reference_data: trial_phase/reference_data/
    scoring_program: scoring_program/
    input_data: input_data/
    # ingestion_program: ingestion_program/
  - index: 1
    name: NEW02PDST1-Closed
    is_public: true
    description: Subtask 1, Closed Track
    reference_data: dev_phase/reference_data/
    scoring_program: scoring_program/
    input_data: input_data/
    # ingestion_program: ingestion_program/
  - index: 2
    name: NEW02PCST1-Closed
    is_public: true
    description: Subtask 1, Closed Track
    reference_data: comp_phase/reference_data/
    scoring_program: scoring_program/
    input_data: input_data/
    # ingestion_program: ingestion_program/

As you can see there are three tasks which share the scoring program and input data path, but use different reference data directories (at the moment the files in those directories are identical but with the real competition they will be different). Then each of the 4 phases uses one of the 3 tasks:

phases:
  - name: Setup Phase
    description: 'Internal phase for developing and debugging the competition code'
    start: 2024-03-01
    end: 2024-04-12
    max_submissions_per_day: 1000
    max_submissions: 1000
    execution_time_limit: 60
    solutions: []
    tasks:
      - 0
  - name: Trial Phase
    description: 'Trial phase: Try everything out with a preliminary small training and test set'
    start: 2024-04-12
    end: 2024-04-30
    max_submissions_per_day: 20
    max_submissions: 20
    execution_time_limit: 60
    solutions: []
    # public_data: public_data
    # starting_kit: starting_kit
    tasks:
      - 0
  - name: Development Phase
    description: 'Development phase: Try everything out, train on training data, evaluate on dev data.'
    start: 2024-05-01
    end: 2024-06-06
    max_submissions_per_day: 1000
    max_submissions: 1000
    execution_time_limit: 60
    solutions: []
    # public_data: public_data
    # starting_kit: starting_kit
    tasks:
      - 1
  - name: Competition Phase
    description: 'Competition phase: Train on training+dev data, submit your best predictions on the test set.'
    start: 2024-06-07
    end: 2024-06-25
    max_submissions_per_day: 3
    max_submissions: 100
    execution_time_limit: 60
    solutions: []
    # public_data: public_data
    # starting_kit: starting_kit
    tasks:
      - 2
  - name: Post Competition Phase
    description: 'Post Competition phase: evaluate continually improving models.'
    start: 2024-06-26
    max_submissions_per_day: 3
    max_submissions: 200
    execution_time_limit: 60
    solutions: []
    # public_data: public_data
    # starting_kit: starting_kit
    tasks:
      - 2

I do not think there is anything wrong with this so far.

From this, the dump with files program generates the following tasks and phases section:

tasks:
- index: 0
  name: NEW02PTST1-Closed
  description: Subtask 1, Closed Track
  is_public: false
  input_data: input_data-6318.zip
  reference_data: reference_data-6318.zip
  scoring_program: scoring_program-6318.zip
- index: 1
  name: NEW02PTST1-Closed
  description: Subtask 1, Closed Track
  is_public: false
  input_data: input_data-6318.zip
  reference_data: reference_data-6318.zip
  scoring_program: scoring_program-6318.zip
- index: 2
  name: NEW02PDST1-Closed
  description: Subtask 1, Closed Track
  is_public: false
  input_data: input_data-6319.zip
  reference_data: reference_data-6319.zip
  scoring_program: scoring_program-6319.zip
- index: 3
  name: NEW02PCST1-Closed
  description: Subtask 1, Closed Track
  is_public: false
  input_data: input_data-6320.zip
  reference_data: reference_data-6320.zip
  scoring_program: scoring_program-6320.zip
- index: 4
  name: NEW02PCST1-Closed
  description: Subtask 1, Closed Track
  is_public: false
  input_data: input_data-6320.zip
  reference_data: reference_data-6320.zip
  scoring_program: scoring_program-6320.zip
solutions: []
phases:
- index: 0
  name: Setup Phase
  description: Internal phase for developing and debugging the competition code
  start: '2024-03-01'
  end: '2024-04-11'
  max_submissions_per_day: 1000
  max_submissions: 1000
  execution_time_limit: 60
  auto_migrate_to_this_phase: false
  hide_output: false
  tasks:
  - 1
  solutions: []
- index: 1
  name: Trial Phase
  description: 'Trial phase: Try everything out with a preliminary small training
    and test set'
  start: '2024-04-12'
  end: '2024-04-30'
  max_submissions_per_day: 20
  max_submissions: 20
  execution_time_limit: 60
  auto_migrate_to_this_phase: false
  hide_output: false
  tasks:
  - 1
  solutions: []
- index: 2
  name: Development Phase
  description: 'Development phase: Try everything out, train on training data, evaluate
    on dev data.'
  start: '2024-05-01'
  end: '2024-06-06'
  max_submissions_per_day: 1000
  max_submissions: 1000
  execution_time_limit: 60
  auto_migrate_to_this_phase: false
  hide_output: false
  tasks:
  - 2
  solutions: []
- index: 3
  name: Competition Phase
  description: 'Competition phase: Train on training+dev data, submit your best predictions
    on the test set.'
  start: '2024-06-07'
  end: '2024-06-25'
  max_submissions_per_day: 3
  max_submissions: 100
  execution_time_limit: 60
  auto_migrate_to_this_phase: false
  hide_output: false
  tasks:
  - 4
  solutions: []
- index: 4
  name: Post Competition Phase
  description: 'Post Competition phase: evaluate continually improving models.'
  start: '2024-06-26'
  max_submissions_per_day: 3
  max_submissions: 200
  execution_time_limit: 60
  auto_migrate_to_this_phase: false
  hide_output: false
  tasks:
  - 4
  solutions: []

Suddenly there are 5 tasks but the phases do not even use 3 of those (with index 0 , 3)! And the zip files for some of the tasks do have identical names: it seems this happens with pairs of tasks where the index of one is not used but the index of the other one is used. I do not understand why this task index change is happening, nor why those files with identical names are placed in the dump file but I am sure, that is not how it should work

You should be able to verify all this by yourself by downloading either the competition bundle or the dump with files dump files.