alan-turing-institute / uatk-spc

Synthetic Population Catalyst
https://alan-turing-institute.github.io/uatk-spc/
MIT License
21 stars 12 forks source link

Possible issue: missing first ID in protobuf output? #31

Closed nickmalleson closed 2 years ago

nickmalleson commented 2 years ago

In the protobuf output I think the first entry in the households and people sections might be missing an ID.

To replicate using a file I created earlier (apols I've not tested with other files):

import synthpop_pb2
pop = synthpop_pb2.Population()
with open(os.path.join("output", "Avon_and_Somerset_Constabulary.pb"), "rb") as f:
    pop.ParseFromString(f.read())
json_str = MessageToJson(pop) # Convert the protobuf to a JSON string
json_obj = json.loads(json_str)

Then examining the json object:

print(json_obj['households'][0])
print(json_obj['households'][1])
{'msoa': 'E02002985', 'origHid': '200298561', 'members': ['0', '1', '2']}
{'id': '1', 'msoa': 'E02002985', 'origHid': '200298581', 'members': ['3', '4']}

See how the first household is missing the 'id'? This is the same with 'people' as well.

dabreegster commented 2 years ago

Urgh, this is an artifact of how protobuf encodes optional data -- in proto3, you can't distinguish optional data from the default "zero" value for that type. Described more in https://alan-turing-institute.github.io/uatk-spc/code_walkthrough.html#protocol-buffers.

The ID itself is there, but protobuf chose to not encode it for efficiency. There's probably an option when converting to JSON to explicitly "fill out" these cases. I'll look around shortly

dabreegster commented 2 years ago

Yep, including_default_value_fields, from https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html. Updating the example script...

dabreegster commented 2 years ago

If you rerun the sample script, the ID fields will show up now. It's still kind of confusing; proto3 is not the best choice, but I ran out of time to try out flatbuffers and some alternatives. If we later do switch over to that or proto2, we'd publish a new version of the SPC schema and output data, and help you fix up any of your scripts. (The changes to your code would be pretty minimal)