Closed RaedMan closed 7 years ago
I want to clarify what was being asked here.
Typically we use the term "data dictionary" to refer to a tabular summary of the fields available in a table, or group of tables.
What I needed, and what we developed, is a tabular summary of the data sets and files that are currently being used and currently available on the analytics server. The problem was that we had source data in different places than normal with obscure names, and it was impossible to determine the file's origins (source and date) based on the file name alone
Personally I'd prefer to move to a system of keeping a verbose but consistent file name in the directory structure so that we can preserve original source files exactly as we received them, including their original file name, and simulatneously be able to identify the files (but he directory structure). So, personally I would like something along these lines:
data/raw/<source>/<date>/<original file name>
data/processed/<source>/<friendly file name>
However it's not really my place to make modifications like that since we're working downstream of the DSSG project.
I pushed up a Data Inventory file.
Building upon the meeting with Eric and @geneorama, as a reminder to publish this when completed.