OPIK-76 Add stream experiment items endpoint

andrescrz commented 1 week ago

Details

This is meant to be used by SDK and to return at least these fields per experiment item: id, trace_id and dataset_item_id and that's exactly what it does, plus any other field in experiments table.

It has the same semantics as the other stream operations in the service: limit (default 500) and last retrieved id cursor.

The search by experiment name follows the same pattern as in the find experiment endpoints: contains regex and case insensitive.

Therefore, it can match experiment items per multiple experiments. For that reason, they're sorted by experiment id first.

Issues

OPIK-76

Testing

Added integration test covering all logical cases.

Documentation

N/A

andrescrz commented 1 week ago

My concern only concerns are:

What if we have too many experiments?

Shouldn't the response format be responsibility of the API layer ?

Those are nice questions. Let me answer then:

Regarding 1: there are multiple alternatives to handle this:

By using the current limit param: this wouldn't work in some cases, as we want to limit by the items.
Adding a limit for experiments: this works but makes the endpoint more cumbersome and results will be less predictable.
Optimising the experiments query to just return the id: this can be done on top any other improvement. I'd rather not to start with, in order to leave this DAO method more reusable for the future. Also, because it might by a not very impactful optimisation.

With all this in mind, I honestly think that the best option is to leave the endpoint like it is. Experiment names are randomly generated and in practice the number of matches within a workspace is going to be very low. We might never encounter an issue with this and if we do, we can always tackle 2, 3 or both.

Regarding 2, I've refactored the services and resources a bit and encapsulated the streaming logic in a Streamer class. I think the separation of responsibilities and reusability of code has improved a lot.

With this, adding new streaming endpoints is much more straight forward. Let me know what you think.

thiagohora commented 1 week ago

By using the current limit param: this wouldn't work in some cases, as we want to limit by the items.

Perfect. My concern was more with having a very large list of experimentIds passed to the second. But I agree this may not be an issue any time soon

andrescrz commented 1 week ago

By using the current limit param: this wouldn't work in some cases, as we want to limit by the items.

Perfect. My concern was more with having a very large list of experimentIds passed to the second. But I agree this may not be an issue any time soon

Yeah, I didn't explain myself well in my previous message, but that's exactly what I meant. We would also need to limit the number of retrieved experiments (and their IDs), in addition to the current limit for items.

But it's unlikely to be an issue for a very long period of time, so I propose to not to do it for now.

thiagohora commented 1 week ago

Agree! An option would be to do the join or a subquery in the IN clause. But can discuss about it later, when if needed

comet-ml / opik