nikhilwoodruff commented 2 years ago

Although this might be relevant to Core, I suspect it'd need a much longer discussion to avoid breaking changes, so filing here with a view to implementing as a patch. There have been a few attempts already in #64 , but with some bugs so I though I'd sketch out the cleanest implementation here.

The problem

Some variables are only relevant to a small subset of the population. For example, Massachusetts income tax only needs to be calculated for people and groups in Massachusetts, and not the rest of the population. Right now, we implement the tax as simply zero for those other people, but this causes wasted computation time and space for 98% of entities, because NumPy vectorised operations happen regardless of the retrospective filter at the end.

The solution

We could have the following variable definition:

class ma_tax(Variable):
  value_type = float
  label = "MA income tax"
  definition_period = YEAR
  unit = TaxUnit

  def eligible(tax_unit, period, parameters):
    return tax_unit.household("state_code", period) == "MA"

  def formula(tax_unit, period, parameters):
    ...

eligible is run first to determine the relevant subset of the population, and then the main formula next. This will be much more efficient iff the formula is much more complex than eligible.

64 has a prototype of the implementation, but it's buggy and needs more thought. I think there's a clean way to do this, intercepting the population passed to the formula to only return the subset values.

cc @MattiSG, @MaxGhenis, @rickecon

MattiSG commented 2 years ago

Thanks @nikhilwoodruff for having taken the time to formalise this!

Since this is performance optimisation, may I suggest to first run a performance analysis? It is often hard to measure precisely the performance impact of a specific implementation and optimisations, especially in the context of vectorial computing 🙂 Lacking specific rules for such cases, I would personally ask for a clear measurement of expected gains before supporting addition of such a feature in Core. You can see an example of such a demonstration in https://github.com/openfisca/openfisca-core/issues/1027.

Currently, I do not know of major needs for performance optimisation internationally on the computing side —more on the memory side. If this need is specific to the case of the US and its 50 states (logically leading to distribute 2% of cases in the disjunction), maybe the creation of a helper wrapping systematic calls to where(), or leveraging extensions could help? I know New South Wales went for an architecture with strong splitting through extensions, this might of inspiration 🙂

cc @benjello @maukoquiroga @liamdmccann

benjello commented 2 years ago

@nikhilwoodruff :

Disclaimer: I did not read your implementation in the PR
I don't think that casting the formula to a smaller population size will increase your speed performance
I recognize that there is a real memory problem. You can somehow make it less important by populating the cache_blacklist of the simulation by intermediate variables that are computed only once.
The real solution to this problem is IMHO (and you may already have heard that from me) to have relational / correspondence links between entities. In this case, you would have MA resident as a separate entity with its own variables (ma_tax) with a one2one relation to the larger US resident entity with US wide generic variables (wage, residence). I actually see this need as a recurrent one (linking firms to employees, employee to retirement schemes and so on).

cc @eraviart

nikhilwoodruff commented 2 years ago

Thanks both for these contributions!

@MattiSG - yes, I'll add a performance analysis. In the US case I'm guessing this roughly equates to a 50x speedup for each affected variable. As for extensions, I think they might not work for our use case, because we might want to see for example how a change to federal tax law affects total state tax liability, in which case we'd need all the individual state variables to run together.

@benjello -

I don't think that casting the formula to a smaller population size will increase your speed performance

Could you elaborate? My thinking is that since the array sizes are smaller, and NumPy does interate over them at some low level, it'll be quicker.

You can somehow make it less important by populating the cache_blacklist of the simulation by intermediate variables that are computed only once.

Nice, thanks for this tip. Will keep it in mind here.

The real solution to this problem is IMHO (and you may already have heard that from me) to have relational / correspondence links between entities.

Yes, really like this idea and this would be a great feature to add to Core, IMO.

benjello commented 2 years ago

@nikhilwoodruff :

I don't think that casting the formula to a smaller population size will increase your speed performance

Could you elaborate? My thinking is that since the array sizes are smaller, and NumPy does interate over them at some low level, it'll be quicker.

NumPy uses the vector capabilities of modern mathematical processors. Modulo that data can be handled by RAM the speed of the computation does not depend much on vector size.

PolicyEngine / openfisca-tools

Partial formula execution #65

The problem

The solution

64 has a prototype of the implementation, but it's buggy and needs more thought. I think there's a clean way to do this, intercepting the population passed to the formula to only return the subset values.