Closed aesharpe closed 1 year ago
In the service of cleaning out constants.py
and getting constants stored closer to where they're used, do the original string codes need to be stored anywhere other than the transform module that they are used in? For metadata purposes, it seems like we could include only the full name and the long description, and ideally in our big metadata collection, have the explicit mapping from full name to description enumerated, and use that to construct the column-level ENUM and descriptive metadata programmatically.
The use of coding tables to document the short codes and enforce a set of allowable values is taking care of this now.
Many of the datasets we work with contain columns populated with string codes whose meaning can only be deciphered by reading a separate layout file. EX:
C
= Contract orNC
= New Contract. Currently, the ETL outputs static csv files that contain the string code to full name and definition mapping, but this information is not readily available to those strictly accessing data through the SQL database or pandas output tables. This issue aims to eliminate all string codes in the database and replace them with their full names.The constants file contains a number of dictionaries that already map the string codes to a few sentences of definition. EX:
{'C': 'Contract -- Fuel received under a purchase order....'}
I propose that:
The dictionaries currently in the constants file be reconfigured such that string code maps to full name and the definition moves to the metadata. EX:
{'C': 'contract'}
. From here on called "string code to full name dictionary".Dictionaries are made for all string code columns not currently accounted for above.
All columns with string codes be given a categorical datatype linked to their respective string code to full name dictionary. EX:
'contract_type': pd.CategoricalDtype(contract_dict.values())
All string code to full name dictionaries be located in a common module (constants or otherwise) for easy and uniform access for categorical enumeration (above).
All metadata for these columns include a full list of their respective the categorical elements and a broader description if necessary (and link to the dictionary if that's possible).
The transform step reference all string code to full name dictionaries such that no columns are left with string codes in the final data.