Striveworks / valor

Valor is a centralized evaluation store which makes it easy to measure, explore, and rank model performance.
https://striveworks.github.io/valor/
Other
38 stars 4 forks source link

Proposed Bulk GET Response Structure #232

Closed jyono closed 1 year ago

jyono commented 1 year ago

Currently, the response looks like this.

{
  "statuses": {
    "DONE": [
      {
        "job_id": 87,
        "dataset": "test_dataset",
        "model": "test_model"
      }
    ]
  },
  "evaluations": [
    {
      "dataset": "test_dataset",
      "model": "test_model",
      "metrics": [
        {
          "filter": "some_filter",
          "metrics": [],
          "confusion_matrices": [
          ]
        }
      ]
    }
  ]
}

What constitutes uniqueness for an evaluation? dataset+model+filter. The way this is currently structured, we are nesting the filters under a dataset+model pair. This is tricky because now we need to store the status array to know what is done. I propose we flatten the structure to make a single evaluation entry information complete. e.g.

{
  "evaluations": [
    {
      "dataset": "test_dataset",
      "model": "test_model",
      "filter": "some_filter",
      "job_id": 87,
      "status": "DONE",
      "metrics":  {},
      "confusion_matrices": {},
        }
      ]
    }
  ]
}

@czaloom @ntlind thoughts?

ntlind commented 1 year ago

we purposefully grouped other information under dataset and model name since those are the input parameters supplied by the client. we could unnest everything as you suggested, but that would mean you'd have to iterate over every single item in the list just to a) figure out which jobs are done and b) get metrics for a specific dataset / model / filter combination that you're interested in. the current configuration groups the data in such a way that you don't have to iterate over a large list on the client-side.

This is tricky because now we need to store the status array to know what is done. I propose we flatten the structure to make a single evaluation entry information complete. e.g.

not sure I understand why storing the statuses separately is a problem. it's meant to be convenient for the client: a user can quickly tell which jobs were DONE, etc. using the response object without having to iterate over all returned evaluations

jyono commented 1 year ago

we purposefully grouped other information under dataset and model name since those are the input parameters supplied by the client

sure. we could add filter and possibly job_id as query params as well.

but that would mean you'd have to iterate over every single item in the list just to a) figure out which jobs are done

right now, we have to iterate through the entire "DONE" status and then iterate through all the evaluations to find the matches we are looking for, if we don't know the job ID. This is already O(nk) where n is number of evaluations and k is the number of DONE status items. Also, how do we differentiate jobs from each other? there is no connection between the evaluation entries(dataset, model, filter) and the status entries(job_id)

b) get metrics for a specific dataset / model / filter combination that you're interested in. the current configuration groups the data in such a way that you don't have to iterate over a large list on the client-side.

if you only want specific dataset/model combinations, the client would only request the ones they are interested in.

we also could make the evaluation a JSON object with keys something like dataset/model/filter for faster lookup options.

e.g.

{
  "evaluations": 
   "test_dataset/test_model/some_filter" {
      "dataset": "test_dataset",
      "model": "test_model",
      "filter": "some_filter",
      "job_id": 87,
      "status": "DONE",
      "metrics":  {},
      "confusion_matrices": {},
        }
    }
  ]
}
ntlind commented 1 year ago

reiterating the most important question here: is there a specific use case where nested evaluations would be more difficult to work with than unnested evaluations? what pain point are we solving by unnesting statuses?

sure. we could add filter and possibly job_id as query params as well.

is there a use case for saying "give me all metrics across all datasets and models for a specific filter criteria?" that doesn't seem intuitive to me, but would be interesting if true! as for job_id, we're hoping downstream clients don't have to interact much with Velour job IDs (i.e., it should probably be abstracted away as much as possible). if the user does have to do get evaluations for a specific ID, then an endpoint already exists for that.

right now, we have to iterate through the entire "DONE" status and then iterate through all the evaluations to find the matches we are looking for, if we don't know the job ID. This is already O(nk) where n is number of evaluations and k is the number of DONE status items.

can you please describe your use case here? the most important use case that I'm aware of that requires job_ids is "help the client understand which jobs weren't finished when the bulk evaluations were fetched".

using the current structure, you'd just look at bulk_evaluations.statuses['NOT_DONE] to get that list. under your proposed structure, you'd have to iterate through all evaluations to build a list of job_ids where status == NOT_DONE. the first approach seems more convenient for that use case. what others am I missing?

Also, how do we differentiate jobs from each other? there is no connection between the evaluation entries(dataset, model, filter) and the status entries(job_id)

what's an example where you'd have to differentiate jobs from one another? in general, we want clients to worry about velour job_ids as little as possible. the use case above is the only one I'm aware of where the client should care about job_ids (e.g., if the client needs to re-query a specific job_id because it wasn't finished when the bulk evaluations were fetched)

the current structure already establishes that relationship between dataset, model, and job_id in bulk_evaluations.statuses. we could add filter there as well, but I'm not sure why we need to. dataset and model were added because those are the two input parameters that are used in get_bulk_evaluations.

if you only want specific dataset/model combinations, the client would only request the ones they are interested in.

not sure I understand what you mean here. get_bulk_evaluations() gets all evaluations for any named datasets and models; it doesn't filter for the intersection between the user-supplied dataset and model

we also could make the evaluation a JSON object with keys something like dataset/model/filter for faster lookup options.

two issues:

ntlind commented 1 year ago

fixed in #236