GSS-Cogs / databaker

Command line tool to convert spreadsheets to databases, made for the UK's Office for National Statistics.
Other
1 stars 0 forks source link

Databaker Enhancements #2

Closed mikeAdamss closed 3 years ago

mikeAdamss commented 4 years ago

Databaker Enhancements

Purely intended as ideas and wishlisting. The more thoughts and conversation the merrier.

I've tried to avoid things that are already covered by other tasks that are in the works (documentation, generating codelists)

Lookup to offset range

So still a cardinal direction but within a certain range on the other axis, veeeery roughly like HDim(cells, "Label", CLOSEST, ABOVE, OFFSET=(-1, 3) or posibly HDim(cells, "Label", WITHIN(-1, 3), ABOVE)

Speedup

full implementation of previous spike Code, i.e "lookup engines" rather than 400 trillion cell lookups per sheet. https://github.com/GSS-Cogs/databaker/tree/gss_speedup_plus

Positional Override

so like a value lookup, but overriding that values based on the cell xy position rather than it's text value. something (again roughly) like:

lookup = {
    "x2y3": "Cheese",
    "x23y3": "Chips"
}

or more likely

lookup = {
    "B4": "Cheese",
    "C4": "Chips"
}

implemented (possibly, but I don't really care) as HDim(cells, "Label", CLOSEST, ABOVE, positionaloverride=lookup)

This would let you override specific header cell values even where other cells in the bag contain the same value (the current cellvalueoverride looks up a new value based on the cell contents, this becomes an issue when for example for have two headers called 'Total' and you want to change one of them).

Assertions

We should probably have easy access to wrapped assertions via the databaker api, it seems an easier way to get far more robust code. could be as simple as follows (Im making up the 'is', 'must' and 'can' methods purely to get the gist across, not particularly thought out):

cells = tab.excel_ref("A2").expand(RIGHT).is(dbtypes.TIME)`
cells = tab.excel_ref("A2").expand(RIGHT).is_one of("cheese", "chips")
cells = tab.excel_ref("A2").expand(RIGHT).can_be(int)
cells = tab.excel_ref("A2").expand(RIGHT).can_not_be(float)
cells = tab.excel_ref("A2").expand(RIGHT).must_have("cheese", "chips").must_not_have("gravy")

You can do any of the above with assert statements and regexes, my point is that we shouldn't have to.

Remove is_not_whitespace()

This is important but can just be wrapped into .is_not_blank() If we didn't have so much legacy code was could remove both and just have an .include_blank() for the rare occasions we'd want to instead, but it's probably not worth the headaches.

Format time

We keep having to do this ourselves in literally every transformation, it'd be nice to just stick it in databaker. Again, not a thought out implmentation, but the gist would be: eg HDim(cellBag, "Ref Period", CLOSEST, LEFT, time=True) if the time format provided didn't match (or was incorrect with) our time patterns, then instead pass a dict to map pattern to handling

timeMap = {
  '^[0-9]{4}-Q[1-4]$': dbtime.gregorianQuarters
}
HDim(cellBag, "Ref Period", CLOSEST, LEFT, time=timeMap)
mikeAdamss commented 3 years ago

im going to close this, it never really became the team space for enhancements I'd imagined and the thinking has moved on, we're probably better off raising ideas as separate issues against our databaker repo.