allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.69k stars 654 forks source link

Get average statistics between group of experiments #499

Open levan92 opened 2 years ago

levan92 commented 2 years ago

Hi, often in ML experiments, we will run several (3-5) runs of the same hyperparameter to cover the randomness in the training, and subsequently report findings based on averaged outcomes (whether is it a mean or median).

May I know if there are any current features of ClearML that allow for grouping of such similar runs & showing just 1 set of averaged statistics instead? This will be useful in comparing against other groups of experiment runs (aka, a group of 3 runs vs another group of 3 runs).

bmartinn commented 2 years ago

Hi @levan92 Is this feature request similar to this one #473 ? Is this an extension of a new way to retrieve data from the system.

BTW: When you get a Task entity (see Task.query_tasks), you can retrieve the Task's metrics with get_reported_scalars and get_last_scalar_metrics.

levan92 commented 2 years ago

Thanks for your reply @bmartinn!

473 & query_tasks still requires additional code to process the results, I'm actually looking at something more intuitive/user-friendly.

Some ways I can imagine would really help:

bmartinn commented 2 years ago

When submitting remote execution jobs, we can indicate a parameter to repeat the same task N times, such that we submit once but N repeated experiments appears on the web UI (at different seeds) "Grouping" on the web UI so that the repeated experiments does not clutter and also allow us to see average on result metrics (validation accuracy for example), and average statistics can be mean/median/mode.

Let me see if I understand, so this is basically grouping?! e.g. collapse all experiments into a single line (criteria unknown), then when we need the details expand them, with the actual values presented in the table (i.e. scalars) as the average over all the lines in the collapsed experiments ?

assuming I understand you correctly, the first challenge is defining which experiments are grouped together. My feeling is that any automagic rule will end up breaking, and users ill want full control over what goes where. This leads to the idea of grouping based on joint "tag/s", now if we are already using tags, why don't we just use the already existing "sub-folder" features and create a folder for each group, wdyt ?

Regrading the scalars summary (i.e. averaging the metric values), this is a great idea to add to the project overview, no?

(basically what I'm trying to say is that nesting (a.k.a collapse / expand) inside tables is always very tricky to get working correctly in terms of UI/UX, where as sub/folders are a more straight forward solution)

idantene commented 2 years ago

When submitting remote execution jobs, we can indicate a parameter to repeat the same task N times, such that we submit once but N repeated experiments appears on the web UI (at different seeds) "Grouping" on the web UI so that the repeated experiments does not clutter and also allow us to see average on result metrics (validation accuracy for example), and average statistics can be mean/median/mode.

Let me see if I understand, so this is basically grouping?! e.g. collapse all experiments into a single line (criteria unknown), then when we need the details expand them, with the actual values presented in the table (i.e. scalars) as the average over all the lines in the collapsed experiments ?

assuming I understand you correctly, the first challenge is defining which experiments are grouped together. My feeling is that any automagic rule will end up breaking, and users ill want full control over what goes where. This leads to the idea of grouping based on joint "tag/s", now if we are already using tags, why don't we just use the already existing "sub-folder" features and create a folder for each group, wdyt ?

Regrading the scalars summary (i.e. averaging the metric values), this is a great idea to add to the project overview, no?

(basically what I'm trying to say is that nesting (a.k.a collapse / expand) inside tables is always very tricky to get working correctly in terms of UI/UX, where as sub/folders are a more straight forward solution)

Not sure if this is the OP intent, but grouping experiments into collapsible rows (without combining metrics or any of their data, just a UI tweak!) is quite common. I think this can probably be achieved in ClearML too - just group by the parent_task?

bmartinn commented 2 years ago

but grouping experiments into collapsible rows (without combining metrics or any of their data, just a UI tweak!) is quite common. I think this can probably be achieved in ClearML too - just group by the parent_task?

Hmm, for it to work properly, we need a good strategy on "parent task":

  1. Cloning a Task without a parent -> Newly created Task parent is the "original" Task (i.e. parent -> child)?
  2. Cloning a Task with a parent -> Newly created Task parent is the "original" Task's parent (i.e. sibling task)?
  3. If you have a Task in draft mode, and you edit it (e.g. completely change everything), do we change (null) the parent Task ?

wdyt?

idantene commented 2 years ago

That's a good question. I think those make for sensible defaults (maybe let the user change the "default parent task" in WebUI?). Then what about cloning a parent task? Is it allowed? Does it clone all child tasks?

bmartinn commented 2 years ago

Hey @idantene

maybe let the user change the "default parent task" in WebUI?

What do you mean by "default parent task" ?

Then what about cloning a parent task? Is it allowed? Does it clone all child tasks?

Allowed, and by design it will not clone the child Tasks. Is there a reason to do that?

idantene commented 2 years ago

What do you mean by "default parent task" ? In reference to (2): Cloning a Task with a parent -> Newly created Task parent is the "original" Task's parent (i.e. sibling task)? , a user may want to change the "original" parent to something else.

Allowed, and by design it will not clone the child Tasks. Is there a reason to do that? It could -- it really depends on how the notion of "parent" task is used. As it is right now, it can either be defined as:

  1. A parent task is task A, such that task B is a clone of Task A with some changes (allows you to go back through changes)
  2. A parent is any task that is used to group other tasks in it, and is not necessarily an original or cloned task.
bmartinn commented 2 years ago

Thanks @idantene , I think I now better understand the use case.

showing just 1 set of averaged statistics instead?

This is reflective of @levan92 original request.

Basically I can think of two paths we can take:

  1. Make the data available via python interface, then one can build its own "dashboards" in Jupyter (or any other solution)
  2. The data itself (in its final form), sits in mongodb. We could have an SQL adapter to mongo then connect it with Apache Superset. This will allow users to build any dashboard they like. For example one can "select" the Tasks based on specific "tag", "parent", or "project", then display one of the metrics distributed over the different Tasks, or the average etc.

wdyt?

timokau commented 2 years ago

I'd also love this feature. It would be great if aggregation was available in the web UI. Ideally it'd be possible to show the mean value while also indicating some other statistic (standard deviation, confidence interval) as an area.

AIM does a good job at this, see this clip as an example. You can first select which criterion to group by (experiment name, some hyperparameter) and then aggregate the scalars based on that criterion.

erezalg commented 2 years ago

Hi @timokau,

Thanks for pointing this out and sharing the clip! This is indeed on our radar and we're evaluating different approaches on how to implement this. I'll come back here once I have a more concrete solution, or if we need more feedback on our thoughts!

pshvechikov commented 1 year ago

Also would like to have this feature implemented. Averaging over seeds is a very basic feature for ML research.

mrodiduger commented 9 months ago

I also want to express my wish to see this feature implemented. It is crucial for workflow of lots of ML researchers

ainoam commented 9 months ago

@mrodiduger Seeing as the discussion so far has diverged in multiple directions :), which of those are you considering when endorsing "this" feature?

Capsar commented 9 months ago

This feature would be highly beneficial for projects with 100 or even 1000+ experiments. Grouping by hyperparameters allows to quickly see the statistical effects of those parameters on the loss or evaluation scalars.

jledragon commented 7 months ago

I would also express a wish for this feature to be implemented.

ainoam commented 7 months ago

@Capsar @jledragon Thanks for bumping this feature request.

Do note that the UI lets you download the experiment table along with any desired custom column in CSV format. Can this help in the meantime?

IcarusWizard commented 7 months ago

I would also have this feature implemented.

@ainoam As far as I understand, we want the feature to be natively supported from the webui (WandB has this feature by the way). Manually download the results and write another script to group and plot is always a solution, but it is not convenient.

ainoam commented 7 months ago

Thanks for joining the conversation @IcarusWizard. The manual processing option is merely there for visibility, to allow progress until additional capabilities are introduced 🙂

johnHostetter commented 6 months ago

It has been a few years since this issue was opened, but I would also like to request this feature be added. This is standard practice in ML research, and is disappointing that such a critical feature is not natively supported by ClearML, as I spent most of my day looking to make this happen on the web UI only to arrive here.

Most ML research often reports plots with confidence intervals over several runs.

I appreciate all the work everyone has put into ClearML, but from this perspective, ClearML may not be in line with ML researchers' expectations.

gntoni commented 4 months ago

I am also very interested in this. Here are some examples in WandB. I would be awesome to have something like that.