Immutable search spaces

I currently see a larger set of problems that can all be traced back to the fact that our current SubspaceDiscrete class is mutable and we should consider making it immutable in the future. Here, I would only like to highlight a few of the problems so that they are not forgotten. We can then deal with them in autumn when I'm back (I already have a potential solution in mind but more on that later).

One of the biggest problems – since it can easily lead to unintended behavior / silent bugs – is that people might not properly take into account the statefulness of the problem, perhaps because they are simply unaware. Even if they are aware, they can easily forget. I've seen it already many times, in fact.

Simple example

Consider someone wants to run several studies on the same basic setup, e.g. testing two recommenders. A common thing that people might do:

from baybe import Campaign
from baybe.parameters import NumericalDiscreteParameter
from baybe.recommenders import FPSRecommender, RandomRecommender
from baybe.targets import NumericalTarget

searchspace = NumericalDiscreteParameter("p", [0, 1, 2, 3]).to_searchspace()
objective = NumericalTarget("t", "MAX").to_objective()
recommenders = [RandomRecommender(), FPSRecommender()]

results = []
for recommender in recommenders:
    campaign = Campaign(searchspace, recommender=recommender)
    recommendations = campaign.recommend(3)
    results.append(recommendations)

At first glance, seems legit, but is wrong. Gives:

baybe.exceptions.NotEnoughPointsLeftError: Using the current settings, there are fewer than 5 possible data points left to recommend ......

because the second campaign uses the metadata that was modified during the first run. The problem can be easily avoided by making the space immutable, just like all other ingredients we offer for specifying campaigns. Of course, this requires a different approach to metadata handling.

After this illustrating case, let me just summarize all issues I currently see. There might be more, so I'll update the list whenever I spot another one:

Chance of unknowingly reusing modified metadata (see example above)
Semantically, metadata should belong to a campaign, not to the search space! It's a record of the experimentation process. We only store it in the search space for convenience reasons. --> Wrong responsibilities
Related to the above point, you might want to be able to manage different sets of metadata simultaneously for different campaigns, without having to also manage corresponding copies of their searchspaces.
Currently, the "marking" process alone is solved very inefficiently (see #344). Really, the process of "adding data to a campaign" should have zero cost.
The current way metadata is accessed / modified is both very inelegant (in terms of user interface), hence error-prone, and gets abused all the time. People mess with it for finer control over the candidate space. But this should be done via a completely separate mechanism, not by touching private attributes.
Because the metadata leaves the campaign after adding data, you cannot easily "undo" operations. For example, it's currently impossible to "remove the latests batch" because we can't access the state of the metadata before it was added.

Please mention your suggested solution briefly

In particular I'm curious whether it would solve the computational issue mentioned in the second last point. In my view this is a real problem, the test depends on interpretation and intent.

After having given several people the metadata approach for workarounds, no one has described the approach as inelegant or error-prone, you are the first one

Alright, can do. Before, quick comment on what you said:

After having given several people the metadata approach for workarounds, no one has described the approach as inelegant or error-prone, you are the first one

It is inelegant and error-prone for several reasons:

very basically, because it forces to use pandas' crappy way of indexing data, which itself is completely inelegant/error-prone. I don't think we need to argue about this, everyone who's gone through the phase of testing whether to use .loc vs .iloc vs regular indexing cross-joined with the choice of using a list of indices, a list of booleans, a series of indices, a series of booleans, a series whose indices are the target indices .... knows the hassle and how bad this is, even the pandas author himself.
it forces you to interact with two objects, one to identify positions, the other one to write to
we tell people to meddle with a private attribute. That alone is a no-go. But it already indicates there is no well-defined interface and we can effectively not change the semantics of that object because people are already locked in
you need to remember the relationship between column index and functionality, which can be easily messed up (e.g. simply by forgetting that python is zero-index based)

Where ever there is room for error, people WILL tap into the problems. We have seen this countless times will less severe things like outdated docs etc. I think enough said 🙃

Bad Solution

Hypothetically speaking, there is a quick fix to the timing issues associated with adding campaign data. But I absolutely discourage going that path: We could simply temporarily cache the added data in the campaign and then only flush the cache (i.e. do the actual search space marking) as soon as it is needed, i.e. when you trigger the next recommendation. A lazy data adding if you will.

However, apart from having to maintain another cache and the additional logic involved, this makes the other problems I've described even worse, because you effectively shatter the metadata information into pieces an distribute them across two objects: part of it sits in the campaign until the cache is flushed, the rest sits in search space. So what used to be stored in a single location, could then even spread across several campaigns, making the coupling problem much worse (i.e. what I showed in the example). So let's better quickly forget about this 😬

Ideal Solution

Here the potential solution I could imagine. For me it solves all the problems and has only one potential downside, namely computational burden. But the latter is only a potential danger and we'd need to test. In fact, I think that it might not be a problem at all (and overall run even faster than what we have now). So let's forget about computational cost for the moment and only focus on layout, we'll talk about the other part later.

The idea would be to just kick the SubspaceDiscrete.metadata attribute altogether and handle stuff from the campaign side:

was_measured: Very easy. We already have that information in campaign anyway. It's literally just the Campaign.measurements attribute
was_recommended: Well, we'd just keep track of that as well. Would have the additional benefit that we can do so with richer information, e.g. by even counting how often something has been recommended. A simple dataframe with experimental columns would do it, containing just the configuations that were recommended. With optional additional columns like recommendations counters.
don't_recommend: This is the most complex part but offers to possibility do completely redesign fine-tuned control over the candidate space. I could imagine just to maintain an exclude attribute that collects user-provided exclusion. The form of that attribute could be designed in very different ways, depending on what we want to support. Three options as examples from the extreme ends of the spectrum (we could even combine them):
- Keep a list of arbitrary polars expressions. Full flexibility to the user, nice interface, compact, ...
- Just a hard-coded list of blacklisted configurations. Very explicit ...
- Keep a dataframe that acts as match-maker for defining blacklist relationships. For example, if you want to exclude a certain condition, we add it there. If we want to exclude all cases where only Param1==val, we fill the remaining columns in that row with a sentinal, etc. Sort of a mix of the above two options

The consequence is that the campaign fully controls its metadata, and this additional enables more flexibility like "undoing" operations like campaignn.remove_last_batch" (have just added this point to list of problems in the main description). With the new layout, this would simple be acampaign.measurements.pop()` operation, assuming we store things internally as list of batches.

The "drawback" is now that basically we need to merge this metadata with the search space before every recommendation. But I don't expect this to become a real bottleneck since:

Even in the most naive version with pandas: merge is implemented super efficiently. So this should work even for very large search spaces without much overhead
We could (and should anyway) change the backend to polars, speeding up things even further
If still too slow, we can add additional caching steps around, to reuse merges from the last recommendation.

Your rant has valid and invalid points (eg the attribute is defacto not private nor should it be according to the current logic that explicitly allows modification, hence one shouldnt argue based on it being private). But I'm not interested in solving theoretical problems of pandas dfs. After all it does not address my initial argument: users seem to be happy with this and know how to use it, do you have evidence to the contrary? We shouldn't assume users are idiots who cannot set a column in a dataframe based on an index...

It also misses the important point: its super minimalistic and lightweight, yet powerful enough to achieve most of whats needed. This is something that feels rather lacking from your proposed solution, in particular whats written under dont_recommend, thats quite some additional logic... for what?

Also remember, that once active_values is generalized, there will be much less need to recommend even touching the metadata.

Furthermore, we already have evidence that the current method might be computationally demanding (https://github.com/emdgroup/baybe/issues/344#issuecomment-2301921910). Based on your proposal your solution seems equally or more demanding. Solve one problem, worsen another, not the way to go. Consider my proposal from below, maybe additonally we can move the involved dfs optionally to polars and also improve merge speed.

Beyond that I see the problem of the current searchspace mutability and it should be addressed. Did you consider following much simpler solution that should also solve the searchspace mutability: have a discrete_metadata frame in the Campaign and not SearchSpace ?

Let me add my opinion as well.

I think we all agree on the main point of search space mutability being bad and that we need to do something about that.
I am with Martin on the point regarding whether or not there are actual complaints regarding the current implementation. As we agreed to several times already, we should try to focus more on the things that are actual concerns/questions of the users (e.g. the adding of data being computationally demanding) instead of fixing things no one complained about.
Just having a discrete_metadata field however feels weird to me, since even a Campaign that does not even have a discrete but e.g. continuous subspace would have that.
For me, the first two points that Adrian makes are basically exactly "Let's move the existing fields to Campaign, right? So I guess we are all fine with that?
Imo, it feels like the dont_recommend is actually the one field that makes problems and which also feels special. Which raises my question: Is this even "metadata"? For me, this feels like something different, maybe it might make sense to rethink how we handle that in general?

emdgroup / baybe