Transport-for-the-North / caf.core

Core classes and definitions for CAF family of tools
GNU General Public License v3.0
0 stars 1 forks source link

Potential for nan indices when combining DVectors #38

Open asongtoruin opened 1 month ago

asongtoruin commented 1 month ago

Background

As part of our Land-Use tests of the updated caf.core code, I keep seeing nan values arise in the indices of our DVectors, particularly when we are multiplying them together.

Sometimes this just results in a UserWarning: This operation has dropped some rows due to exclusions in the resulting segmentation., i.e. we are triggering this part of _generic_dunder:

https://github.com/Transport-for-the-North/caf.core/blob/1f4a9344a8b4ef09a4d3f734ef514a22f25423ac/src/caf/core/data_structures.py#L725-L731

Which is maybe fine - the nan values are being dropped.

The other issue we're having is a larger (and breaking) one - sometimes we get an IndexError: The segmentation of the read in dvector data does not match the expected segmentation. This is likely due to unconsidered exclusions., i.e. validate_segmentation is going awry:

https://github.com/Transport-for-the-North/caf.core/blob/1f4a9344a8b4ef09a4d3f734ef514a22f25423ac/src/caf/core/segmentation.py#L307-L315

Crucially, this is happening during an .expand_to_other operation, as a result of a .add_segment call and the associated validation of the new segmentation - i.e. it's occurring earlier in the process than the warning

Understanding the problem

I think the issue is being caused by an incompatibility between the Exclusions. We have pop_econ 3 listed as incompatible with pop_emp 1, 2, 3, & 4 - i.e. it's mapped solely to pop_emp 5:

https://github.com/Transport-for-the-North/caf.core/blob/1f4a9344a8b4ef09a4d3f734ef514a22f25423ac/src/caf/core/segments/pop_econ.yml#L9-L29

There no restrictions on age_9 for pop_econ 3, however pop_emp 5 is only compatible with age_9 1, 2, 3, or 9:

https://github.com/Transport-for-the-North/caf.core/blob/1f4a9344a8b4ef09a4d3f734ef514a22f25423ac/src/caf/core/segments/pop_emp.yml#L57-L84

So we can have e.g. data in one dataset with pop_econ 3 and age_9 4 which is a combination not present in another dataset (containing pop_emp).

The thing I don't understand is why sometimes it only gives us a warning and sometimes fully errors out.

Possible solutions

  1. The exclusions should probably be updated to make sure they're compatible. I'm not totally sure on the reasoning behind this change in exclusions, so I'm not sure how they need to be updated.
  2. Is there any scope to add some kind of utility function to confirm whether our exclusions are compatible with eachother? Should this be a part of the test suite?

@isaac-tfn any thoughts? I can pass across example files with each corresponding response if required.

isaac-tfn commented 1 month ago
  1. On the exclusions - I think this is an issue with the way these two segments are defined. Economically inactive could be any age, but the only economically inactive segments defined in pop_emp are non-working_age, and students. As students exists in both they map onto each other so I excluded pop_emp: students from pop_econ: economically inactive. A solution would either be to alter the definition of pop_emp: 3 from 'unemployed' (which by definition means no job but economically active) to 'not working' for example, which would include the econmically inactive for reasons other than age or being a student.

  2. There should be more validation of segment compatibility etc. yes. I don't have time to work on that right now as that's more than a hot fix, but I may be able to find some next week.

For now could you update segment definitions/exclusions as appropriate on your end and we'll review on our end in the future? If with sensible exclusions etc. this issue persists I'll have a closer look at what's going on.

If you have the code causing this issue in any kind of snippet it would be helpful if you could share it so I can have a look if I get a chance today.

asongtoruin commented 1 month ago

@isaac-tfn I've added the relevant files into F:\Working\Land-Use\OUTPUTS\TEMP_NAN CHECKS, along with an error_demo.py that reproduces the two issues.

isaac-tfn commented 1 month ago

Thanks, any joy with fixing the segments/exclusions?

MatteoGravelluTfN commented 1 month ago

@asongtoruin, could you please confirm if this is providing issues? I understand the code was running with some size errors.