Open AdrianSosic opened 2 months ago
Please mention your suggested solution briefly
In particular I'm curious whether it would solve the computational issue mentioned in the second last point. In my view this is a real problem, the test depends on interpretation and intent.
After having given several people the metadata approach for workarounds, no one has described the approach as inelegant
or error-prone
, you are the first one
Alright, can do. Before, quick comment on what you said:
After having given several people the metadata approach for workarounds, no one has described the approach as
inelegant
orerror-prone
, you are the first one
It is inelegant and error-prone for several reasons:
.loc
vs .iloc
vs regular indexing
cross-joined with the choice of using a list of indices
, a list of booleans
, a series of indices
, a series of booleans
, a series whose indices are the target indices
.... knows the hassle and how bad this is, even the pandas author himself.Where ever there is room for error, people WILL tap into the problems. We have seen this countless times will less severe things like outdated docs etc. I think enough said 🙃
Hypothetically speaking, there is a quick fix to the timing issues associated with adding campaign data. But I absolutely discourage going that path: We could simply temporarily cache the added data in the campaign and then only flush the cache (i.e. do the actual search space marking) as soon as it is needed, i.e. when you trigger the next recommendation. A lazy data adding if you will.
However, apart from having to maintain another cache and the additional logic involved, this makes the other problems I've described even worse, because you effectively shatter the metadata information into pieces an distribute them across two objects: part of it sits in the campaign until the cache is flushed, the rest sits in search space. So what used to be stored in a single location, could then even spread across several campaigns, making the coupling problem much worse (i.e. what I showed in the example). So let's better quickly forget about this 😬
Here the potential solution I could imagine. For me it solves all the problems and has only one potential downside, namely computational burden. But the latter is only a potential danger and we'd need to test. In fact, I think that it might not be a problem at all (and overall run even faster than what we have now). So let's forget about computational cost for the moment and only focus on layout, we'll talk about the other part later.
The idea would be to just kick the SubspaceDiscrete.metadata
attribute altogether and handle stuff from the campaign side:
was_measured
: Very easy. We already have that information in campaign anyway. It's literally just the Campaign.measurements
attributewas_recommended
: Well, we'd just keep track of that as well. Would have the additional benefit that we can do so with richer information, e.g. by even counting how often something has been recommended. A simple dataframe with experimental columns would do it, containing just the configuations that were recommended. With optional additional columns like recommendations counters.don't_recommend
: This is the most complex part but offers to possibility do completely redesign fine-tuned control over the candidate space. I could imagine just to maintain an exclude
attribute that collects user-provided exclusion. The form of that attribute could be designed in very different ways, depending on what we want to support. Three options as examples from the extreme ends of the spectrum (we could even combine them):
Param1==val
, we fill the remaining columns in that row with a sentinal, etc. Sort of a mix of the above two optionsThe consequence is that the campaign fully controls its metadata, and this additional enables more flexibility like "undoing" operations like campaignn.remove_last_batch" (have just added this point to list of problems in the main description). With the new layout, this would simple be a
campaign.measurements.pop()` operation, assuming we store things internally as list of batches.
The "drawback" is now that basically we need to merge this metadata with the search space before every recommendation. But I don't expect this to become a real bottleneck since:
Your rant has valid and invalid points (eg the attribute is defacto not private nor should it be according to the current logic that explicitly allows modification, hence one shouldnt argue based on it being private). But I'm not interested in solving theoretical problems of pandas dfs. After all it does not address my initial argument: users seem to be happy with this and know how to use it
, do you have evidence to the contrary? We shouldn't assume users are idiots who cannot set a column in a dataframe based on an index...
It also misses the important point: its super minimalistic and lightweight, yet powerful enough to achieve most of whats needed
. This is something that feels rather lacking from your proposed solution, in particular whats written under dont_recommend
, thats quite some additional logic... for what?
Also remember, that once active_values
is generalized, there will be much less need to recommend even touching the metadata.
Furthermore, we already have evidence that the current method might be computationally demanding (https://github.com/emdgroup/baybe/issues/344#issuecomment-2301921910). Based on your proposal your solution seems equally or more demanding. Solve one problem, worsen another, not the way to go. Consider my proposal from below, maybe additonally we can move the involved dfs optionally to polars and also improve merge speed.
Beyond that I see the problem of the current searchspace mutability and it should be addressed. Did you consider following much simpler solution that should also solve the searchspace mutability: have a discrete_metadata
frame in the Campaign
and not SearchSpace
?
Let me add my opinion as well.
discrete_metadata
field however feels weird to me, since even a Campaign
that does not even have a discrete but e.g. continuous subspace would have that.Campaign
, right? So I guess we are all fine with that?dont_recommend
is actually the one field that makes problems and which also feels special. Which raises my question: Is this even "metadata"? For me, this feels like something different, maybe it might make sense to rethink how we handle that in general?
I currently see a larger set of problems that can all be traced back to the fact that our current
SubspaceDiscrete
class is mutable and we should consider making it immutable in the future. Here, I would only like to highlight a few of the problems so that they are not forgotten. We can then deal with them in autumn when I'm back (I already have a potential solution in mind but more on that later).One of the biggest problems – since it can easily lead to unintended behavior / silent bugs – is that people might not properly take into account the statefulness of the problem, perhaps because they are simply unaware. Even if they are aware, they can easily forget. I've seen it already many times, in fact.
Simple example
Consider someone wants to run several studies on the same basic setup, e.g. testing two recommenders. A common thing that people might do:
At first glance, seems legit, but is wrong. Gives:
because the second campaign uses the metadata that was modified during the first run. The problem can be easily avoided by making the space immutable, just like all other ingredients we offer for specifying campaigns. Of course, this requires a different approach to metadata handling.
After this illustrating case, let me just summarize all issues I currently see. There might be more, so I'll update the list whenever I spot another one: