codalab / codabench

Codabench is a flexible, easy-to-use and reproducible benchmarking platform. Check our paper at Patterns Cell Press https://hubs.li/Q01fwRWB0
Apache License 2.0
74 stars 28 forks source link

[Feature] Meta scoring program: organizer provides the code for computation between leaderboard columns #406

Open Didayolo opened 4 years ago

Didayolo commented 4 years ago

Competition organizers can define their own scoring and ingestion programs in a flexible way with Python code.

But, in Codalab v15, the computation of the final ranking when there are several sub-scores is constrained between few possibilities (average rank, maximum...).

I think it is crucial to be able to use any function to compute this final ranking, so organizers should be able to provide a kind of "meta scoring program".

The same idea was mentioned here:

Much more flexible way to handle computation columns. This should be a fine solution for a much more advanced "computational column" type system. Organizers could do any kind of computation they'd like! This should be executed even if children have failed.

Also, a similar feature we'd like:

Another related issue:

Currently, there is one column(primary column) that is used to decide ranking on the leaderboard. We should add ways to compute leaderboard ranking across all columns on the leaderboard, and have a way to exclude to certain columns from the ranking calculation. Make computed columns.

  • Could automatically be generated by columns with the same key
  • Pro: simple and programmatic
  • Cons: Would have less control
  • Could require each column to have a unique key and then group the columns later to be calculated OR
  • Could instead select columns by key + task
  • Pro: More control in selection
  • Con: tasks more work by admins

And the following post-it / discussion issue:

ckcollab commented 4 years ago

Something similar to this may have been implemented to support AutoDL (multiple tasks) in v2? There is a way to do weighted sum scoring across children tasks without the need for a master scoring program, the master phase defines this behavior. Zhen and Jimmy implemented this for AutoDL in Jan/Feb, I believe.

ihsaan-ullah commented 1 year ago

Something similar to this may have been implemented to support AutoDL (multiple tasks) in v2? There is a way to do weighted sum scoring across children tasks without the need for a master scoring program, the master phase defines this behavior. Zhen and Jimmy implemented this for AutoDL in Jan/Feb, I believe.

@Didayolo do you know something about this?

Didayolo commented 1 year ago

AutoDL version of CodaLab is the 1.8 version (I think the Github branch is named google-cloud or something like this). It corresponds to this instance of CodaLab: https://autodl.lri.fr/

In this version, we have a master scoring program and then the phases are used like tasks in Codabench. Indeed we had average rank here.

But I don't think it is useful to look here; average rank is implemented in vanilla CodaLab (v1.6). It was called using the Avg operation (which is not a clear naming because Avg should be the simple score average), like in this example of competition.yaml file for a CodaLab competition with average rank:

leaderboard:
  columns:
    auc_binary:
      label: Classification (AUC ROC)
      leaderboard: &id001
        label: RESULTS
        rank: 1
      rank: 2
    auc_binary_2:
      label: Feature Selection (AUC ROC)
      leaderboard: *id001
      rank: 3
    ave_score:
      label: < Rank >
      leaderboard: *id001
      rank: 1
      sort: asc
      computed:
        operation: Avg
        fields: auc_binary, auc_binary_2
  leaderboards:
    RESULTS: *id001

The computation is there in the code:

https://github.com/codalab/codalab-competitions/blob/68036d9b06272ce23572d121c85b3c00ee49a21a/codalab/apps/web/models.py#L1251

In Codabench, here is the yaml syntax:

[...]
  columns:
      - title: Average Accuracy
        key: avg_accuracy
        index: 0
        sorting: desc
        computation: avg
        computation_indexes:
          - 1
          - 2
          - 3
          - 4
[...]

But this time, Avg is for average; average rank is not implemented. We also can also edit this using the editor.

Last point, implementation of ranking functions can be found in this library:

rk.score is the simple average, which calls np.mean. rk.borda is the average rank. The naming is this way because it is the names of these functions in the field of Social Choice Theory.

Didayolo commented 1 year ago

Also, just for the record, here are schemas to understand the difference between "average" and "average rank", and why it changes how the computation column should be computed in Codabench:

https://docs.google.com/presentation/d/1_qg2AcBUeBtIlYGq754wMHMkAn3mzIT_JjlLAepjWZI/edit?usp=sharing

ihsaan-ullah commented 1 year ago

Plan for this feature @Didayolo

  1. Add option in yaml file to specify meta-scoring columns. Example with specified columns:
    leaderboards:
    - index: 0
    meta-scoring:
      computation: average rank
      columns : 
        - score_1
        - score_2

Example with columns containing a string (useful when new columns could be added later):

leaderboards:
 - index: 0
   meta-scoring:
      computation: average rank
      column_ends_with : _score

we can have two keys column_ends_with and column_starts_with

  1. store meta-scoring fileds
  2. Apply meta-scoring when fetching a leaderboard

Later on add more meta-scoring programs

Didayolo commented 1 year ago

I think we should keep the interface we already have, which is mentioned above:

  columns:
      - title: Average Accuracy
        key: avg_accuracy
        index: 0
        sorting: desc
        computation: avg
        computation_indexes:
          - 1
          - 2
          - 3
          - 4

and the editor:

Capture d’écran 2023-06-22 à 13 46 23

Here you specify precisely which columns are used in the computation. Just, instead of Avg, we'd like to have more possibility.

Maybe having a meta-scoring program, as proposed at first by this issue, is too much, and we just want to allow ranking functions of the Ranky package.

Didayolo commented 1 year ago

There is also the question of the v1 unpacker (#368)

juliojj commented 8 months ago

It is a pity average ranking is not available in Codabench. This is a very useful feature in Codalab. It would be great if you consider including it in the future. Thanks