CityofSantaMonica / mds-provider

Python tools for working with MDS Provider data
https://github.com/openmobilityfoundation/mobility-data-specification
MIT License
18 stars 20 forks source link

fix read_data_file to handle lists of payloads #60

Closed yoimnik closed 5 years ago

yoimnik commented 5 years ago

if we specify a file path containing a list of payloads, which is how the ingest script outputs, then the library doesn't handle the case that the file may be a list of payloads

yoimnik commented 5 years ago

@thekaveman requesting review

thekaveman commented 5 years ago

Thanks for this.

I feel like there might be some cross-over concern with some of what ProviderDataLoader is doing in load_from_source here https://github.com/CityofSantaMonica/mds-provider/blob/dev/mds/db/load.py#L138.

I've been wanting to cleanly refactor some of this functionality but haven't had a chance yet to dig into the specifics or think about what it would look like.

thekaveman commented 5 years ago

Your fix looks good.

One additional concern is that this change essentially trusts that the rest of the files also have the same version as the first. Do you have any thoughts on this? Potential workaround is only extend data when the version matches the first.

yoimnik commented 5 years ago

hmm, i didn't put too much thought into this earlier

i don't think we should only extend data when the version matches the first. this is a bit of implied behavior and could end up confusing a developer down the road.

i think it's reasonable to expect the developer to know what version of the payload they're working with. if we're using semantic versioning (ie. major.minor.patch):

thekaveman commented 5 years ago
  • if they're working with a list of payloads that have a major version difference, then this should throw with an error message
  • how do we handle minor version differences? are schema changes allowed in minor version bumps in MDS?

I like the idea of throwing an error on a mismatch instead of silently reading only matching data.

In general, a MINOR version change in MDS should only include non-breaking changes (adding a non-required field for example). The only complication is with MDS being early in development still (pre-1.0.0), the MINOR version changes are potentially breaking.

yoimnik commented 5 years ago

I like the idea of throwing an error on a mismatch instead of silently reading only matching data.

@thekaveman added this :)

thekaveman commented 5 years ago

A suggestion on the implementation of this:

Capture the version from the first page, and compare each page["version"] with that initial one, throwing the first time you find a mismatch. Something like:

data = []
version = payload[0]["version"]

for page in payload:
    if page["version"] == version:
        data.extend(page["data"][record_type])
    else:
        raise ValueError("Version mismatch detected.")

As of your most recent commit, I think the version check is not at the correct indentation level.