Transport-for-the-North / caf.core

Core classes and definitions for CAF family of tools
GNU General Public License v3.0
0 stars 1 forks source link

IPF calculation can introduce unusual values into data #47

Open asongtoruin opened 1 month ago

asongtoruin commented 1 month ago

When the IPF factors are calculated, they're just done through a straightforward division:

https://github.com/Transport-for-the-North/caf.core/blob/90c9f53fd59cbd2f0a8a876fbcb56198cbcf02fb/src/caf/core/data_structures.py#L1405

However, if we have a 0 in:

I think this latter case is happening in our population calcs, as we're ending up with NaN values somewhere in the data post-IPF. Feels like something that should be fairly easy to handle, maybe even just with a .fillna(0) - @isaac-tfn, any thoughts?

isaac-tfn commented 1 month ago

Thanks for raising, yes this is something I expected to cause issues, I think the first and last scenarios are easy to handle, i.e. anywhere the target is zero, the result should be zero so yes, for those nans just infill with zero. Where agg is zero but target isn't is the issue, as that will prevent convergence. It depends how much we trust the seed DVector; if we trust it then this scenario is fine as those cells will always remain zero, and as long as there aren't too many zeros the convergence shouldn't take too big if a hit. If we don't trust those zeros in the seed we may want to infill zeros before the process even starts so those cells can be adjusted. Probably best just to have an 'infill zeros' parameter on the method I think?

asongtoruin commented 1 month ago

I think for the time being it may be sufficient to just report if infinite values are cropping up, maybe including a report of where they're happening, and leave it to the used to handle seeding? I feel like something that blanket replaces zeroes could be too coarse.

isaac-tfn commented 1 month ago

OK for now I'm just adding a warning that there are inf factors, and the user can double check where the zeros are in the seed. I forgot to tag the issue but should be working now

asongtoruin commented 1 month ago

We're having an issue with the .inf check, a ValueError: The truth value of a Series is ambiguous. Pretty sure it's here:

https://github.com/Transport-for-the-North/caf.core/blob/9a8e9248bf4e1c1c9e740ca5bd223402f31b9020/src/caf/core/data_structures.py#L1414

factor.data is a DataFrame, so .any returns a Series. This line should probably be:

if factor.data.eq(np.inf).any(axis=None):

(or .any().any() if axis is not available)

isaac-tfn commented 1 month ago

Sorry I had tested it on a series so it was returning one value, will fix now.