Type handling #87

Open lewisjared opened 1 year ago

In GitLab by @zebedee.nicholls on Apr 6, 2023, 15:53

The problem

I was using the PRIMAP bookshelf the other day, and something not ideal happened. In short, the data appears to be saved or read by Pandas (not sure where exactly this happens) as mixed type, which means the categories aren't handled correctly. See the code excerpt below

>>> shelf = bookshelf.BookShelf()
>>> primap_book = shelf.load("primap-hist", version="v2.4.2")
>>> emissions = primap_book.timeseries("by_country")
>>> emissions.get_unique_meta("category")
# 2 and 1 appear as both ints and strings, comes with a DtypeWarning
# from pandas
[0,
 1,
 '1',
 '1.A',
 '1.B',
 '1.B.1',
 '1.B.2',
 '1.B.3',
 '2',
 2,
 '2.A',
 '2.B',
 '2.C',
 '2.D',
 '2.G',
 '2.H',
 '3',
 '3.A',
 '4',
 '5',
 'M.0.EL',
 'M.AG',
 'M.AG.ELV',
 'M.LULUCF']

Definition of "done"

The mixed types are removed. For this particular example, I think everything should be a string. There may be more general principles that can be applied having seen this use case though.

Additional context

The workaround is relatively easy, emissions["category"] = emissions["category"].astype(str).

This test fails (but feels like the wrong way to test for this bug as it is quite slow and a very high-level integration test for what seems like a low-level issue)

def test_primap_book(local_bookshelf):
    shelf = BookShelf()
    primap_book = shelf.load("primap-hist", version="v2.4.2")

    emissions = primap_book.timeseries("by_country")
    assert emissions.get_unique_meta("category") == [
        "0",
        "1",
        "1.A",
        "1.B",
        "1.B.1",
        "1.B.2",
        "1.B.3",
        "2",
        "2.A",
        "2.B",
        "2.C",
        "2.D",
        "2.G",
        "2.H",
        "3",
        "3.A",
        "4",
        "5",
        "M.0.EL",
        "M.AG",
        "M.AG.ELV",
        "M.LULUCF",
    ]

Nice find.

This may suggest the need for a more formalised "data dictionary" describing the permissible dimensions of metadata, the data type of the dimension and (for some dimensions) their permissible values. There could then be some checks to validate this.

I wonder whether the "data dictionary" ends up being concept in scmdata one day

@zebedee.nicholls This is actually a bug in scmdata.

The values are all strings when written to file and pandas when reading casts some of the values to ints, seemingly at random (3, 4 and 5 aren't converted as well as not all of 1 and 2). Maybe this behaviour is because the column starts with a 0??

This suggests that some additional metadata is needed a la #21 but with an additional dtype specified. While bookshelf can update ensure the dtypes of the meta before returning the result to the user, we should discuss how much responsibility falls on scmdata for some of this functionality.

The current implementation assumes all columns are type 'object' and doesn't do any dtype inference. I don't even know how differing dtypes for meta columns would affect other operations.

Perhaps we need to add a dtype parameter in ScmData when reading in files? That is a pretty low impact change, but I haven't thought through it in detail.

We might also get some of the performance improvements from pyarrow-backed columns if they have a non-object dtype

In GitLab by @zebedee.nicholls on May 11, 2023, 11:40

Ok nice find. A possible way through could be:

on reading csvs, just leave it to pandas to decide. Pandas does crazy stuff, but that's probably not our responsibility. Maybe we warn users if they load from raw csv, that the types might not make sense.
we add a new on disk format, which we make our default. Basically the same as primap2's interchange format i.e. we save data and metadata (there is also the data package idea that we could engage with/build on, but Mika and Johannes will know the pro's and cons of that better than me). That way the default path handles having this metadata saved with all the benefits it brings, but we don't force users that way

Responsibility for this probably does fall onto scmdata mainly. I guess I just want to avoid us hacking around pandas the entire time (so I think we want two paths: one which just lets pandas do its thing, and one path that 'takes over' from pandas and makes everything explicit i.e. doesn't let pandas do any guessing).

In GitLab by @zebedee.nicholls on May 11, 2023, 11:40

Having written the above though, I also haven't fully thought it through so happy to try a different solution if that is more attractive

In GitLab by @mikapfl on May 11, 2023, 17:59

I guess the https://specs.frictionlessdata.io/table-schema could be one way to specify metadata like types etc. Not more or less effort than rolling something ourselves like we did in primap2. With primap2 I made the mistake to try frictionless' code (the frictionless framework) and it was broken two years ago and is broken now, so I gave up on frictionless completely. However, frictionless' specifications look stable and maybe useful. Unfortunately, the lack of working software to automatically turn a pandas DF into a fully-specified data package and back means that you also don't gain much.

If I had to choose a format to describe higher-level metadata (license, citation info, title etc.) I'd rather use the .zenodo.json format because it is actually used in the field and if you have it, publishing to zenodo is easier. Unfortunately, it doesn't allow specifying e.g. the type of columns.

In GitLab by @mikapfl on May 11, 2023, 18:04

For round-tripping pandas DFs via CSV, we use quoting and optionally specifying more fancy dtypes in a metadata object. But the quoting is already enough for simple cases like the one triggering this bug right here.

Using a frictionless spec could be nice as other metadata already uses the datapackage specification. Technically each data bundle that is served is a data-package containing multiple data-resources. I haven't validated that the bundle is strictly conformal with the spec, but given how loose it is I'd be surprised if it wasn't.

I haven't looked at table-schema spec in detail, but I'm assuming that we would end up with a bastardised version of it where we only track the metadata columns and ignore the fact that there are other columns.

Even if we follow the primap2 way, we would still need to modify that specification to cover the typing, triggering changes in both ecosystems.

Ideally, in a perfect world, the scmdata meta file is a subset of the primap2 interface format given that it is more flexible, but I'm not sure if the engineering effort would be worth it as we will end up writing interfaces to cover converting to/from everything anyway. I'm not sure if the exact format matters too much as long as concepts map.

Sounds like I need to draft an RFC :grinning:

I'd rather use the .zenodo.json

@mikapfl Do you have a link to the specification for this?

That might be the temporary fix in this case

In GitLab by @mikapfl on May 11, 2023, 19:16

Regarding the .zenodo.json, there is https://developers.zenodo.org/#representation It is not super pretty, but it defines all the supported keys and pointers and the controlled vocabulary.

Even quoting the string values still causes an issue: https://github.com/pandas-dev/pandas/issues/31821

In GitLab by @mikapfl on May 12, 2023, 22:04

Or, another idea is to store netcdf files in the bookshelf instead of CSV. netcdf already handles all the typing problems in a machine-readable fashion.

primap-hist repo (and test) has been moved to https://github.com/climate-resource/bookshelf-primap-hist

climate-resource / bookshelf

Type handling #87

The problem

Definition of "done"

Additional context