SeitaBV / timely-beliefs

Model data as beliefs (at a certain time) about events (at a certain time).
MIT License
34 stars 6 forks source link

Retain source lineage when multiplying deterministic uni-source BeliefsDataFrames #33

Open Flix6x opened 4 years ago

Flix6x commented 4 years ago

I'm proposing a new feature here (multiplying data from different sensors), and would like to discuss how to handle source lineage. That is, being able to track the source of a data value.

Consider the case of multiplying power data (MW units) with price data ($/MWh units) to obtain costs data ($/h units). The relevant operations for each of the columns / index levels in the BeliefsDataFrame should be:

How should we handle the source? Three options I can think of: ***

  1. Set the source to None, accepting the loss of source lineage (note we already lose one of the two belief_times)
  2. Set the source to a list of BeliefSources, introducing the concept of multi-sourced values.
  3. Create a new BeliefSource (if it doesn't exist already), which leaves open the possibility of source lineage if the new source holds information about the component sources. That is, it can later be modelled as an AggregatedBeliefSource, which subclasses BeliefSource, which also introduces the concept of multi-sourced values.

* In case one of the frames uses event_start and the other uses event_end, respect the index perspective of the first frame. ** In case one of the frames uses belief_time and the other uses belief_horizon, respect the index perspective of the first frame. *** There is an analogy here to the issue of how we would handle the sensor attribute of the resulting BeliefsDataFrame: 1. None, 2. a list of Sensors or 3. a new AggregatedSensor.

nhoening commented 4 years ago

I'd say we should consider how this multiplication feature will be used most of the time. Is the result going to be persisted or is the result used temporarily? In the former case, a source might be interesting to have, not in the latter. But even in the former case, it's easy to add a source after multiplication.

I believe in our current usage, we are doing the latter. We might use the former for performance reasons (caching multiplications), but we're not exactly sure yet.

I find the idea of multi-source values interesting, but I would separate them from this feature for now. I would make a comment about the information being lost (also the belief_horizon bit).

Flix6x commented 4 years ago

Separating this issue from the multiplication feature doesn't necessarily make things easier.

If we set the resulting source to None, then the resulting frame will not be a valid BeliefsDataFrame anymore (several properties of our subclass would fail), so the result would have to be a pandas DataFrame. That also means we lose the slicing and plotting methods of our subclass.