utterances-bot commented 3 years ago

Column Names as Contracts | Emily Riederer

Using controlled dictionaries for low-touch documentation, validation, and usability of tabular data

https://emilyriederer.netlify.app/post/column-name-contracts/

morgango commented 3 years ago

Very interesting ideas, kind of like Hungarian notation but for data. The company I work for has something called Elastic Common Schema which is another take on this, but more focused on how the data is ingested versus how the data is modeled.

One of the challenges with all these syntaxes is portability. For example, if you have a system that understands geospatial data and you export data to a system that does not, do you change the name of the column? I would say yes, but that might be unexpected behavior downstream.

Also, as your data architecture evolves, your syntax may need to evolve as well. For example, the world used to be all CSV/RDBMS related with columns and rows being the best way to represent things. Then, JSON and NoSQL came along and turned things on their ear. I would be interested to see how you might approach a more complex data model with the syntax.

Great article!

bvancil commented 3 years ago

Thank you so much for writing this! I especially appreciate the depth of thought you manifested in metadata discoverability and data testing.

I'm curious to know, if you were dealing with many probabilities (perhaps predicted in a modeling step), would you prefer extending the level 1 schema, e.g. P_SUBJECT_DETAIL or PRB_SUBJECT_DETAIL, or using the level 1 VAL_ prefix, e.g. VAL_SUBJECT_DETAIL_PROB?

jeffzi commented 3 years ago

Thanks for the nice (and very complete) write-up.

I follow similar naming conventions but I use level 1 - measure types as a suffix instead of a prefix. In my opinion it makes it easier to identify similar level 2 - measure subjects. It's subjective, but I also find it more natural to read.

For example: N_TRIP_ORIG, N_TRIP_DEST, AMT_TRIP_DIST becomes TRIP_ORIG_N, TRIP_DEST_N, TRIP_DIST_AMT

emilyriederer commented 3 years ago

@morgango - Thanks so much for those references! I've enjoyed reading up on both Hungarian notation and ECS. I love seeing the similarities / differences. Even more, thank you for flagging those challenges.

I didn't emphasize it in the article, but I also hope conventions like these could help preserve data lineage (since the same quantities should naturally end up with the same names), but your point on GIS and different system requirements is fascinating when you must either sacrifice the standard or break the contract. Like you, I'm inclined to sacrifice the standard to ensure the contract holds (e.g. change the name if the data can no longer perform as specified), but it is uncomfortable.

I'd have to think more about NoSQL, but for JSON, my first guess is that you might define appropriate names at different levels of nesting. The GitHub API I think does a decent job of this with entities as the top level and some level of standardization in the fields within each. Do you think that could work?

Thanks so much for writing back! I really love hearing these corner cases.

emilyriederer commented 3 years ago

@bvancil - That's an interesting question! I think it probably depends a bit how I would intend to use those probabilities, but my initial reaction would be to add to the level 1. My reasoning is that I probably have fundamentally different validation checks to perform on these (e.g. should be bounded between 0 and 1) and potentially different use cases. Of course, it's just as possible to use select helpers (ends_with()) or regex (.*_PROB$) to still handle variables the same if they are in the suffix, but to me this feel like "first-class" information. I tend to think of the suffix as more like adjectives, like AMT_SUBJECT_3MO or AMT_SUBJECT_6MO might reflect trailing 3 month and 6 month observations for the same measure of the same subject.

emilyriederer commented 3 years ago

@jeffzi - Thank you for sharing! You do make a good point that when one is at the beginning and the other is at the end, both can be highlighted. I do like that when I think about the autocomplete sense as well; perhaps users would rather "search" based on the entity of interest (for which a measure might or might not exist) rather than a measure 🤔 . I'll have to try out that way, too!

CorneeldH commented 1 year ago

Thanks for sharing!

I read this before, included a reference in our internal style guide and now I read the comments which also add value.

Few things I wanted to share:

Currently we have subject indeed before, it improves readability etc, while the automated checking (for which the "types"are ideal) can still be done since it is automated
For my use-case (a lot of different data tables / sources / subjects related to same relationship diagram), I'm thinking of 4 levels:

what you call "subject" or aggregation level
"topic". For instance if you had demographic variabels of drivers "DEM" would be a topic. Mind you, this is quite different from the CITY example in the sense that DEM can have lot of different types etc.
Source. Not sure about this one, but we have a limited amount of sources and not all are stabile. So to see very quickly were a variable is from helps.
Type. Based on your list with some "local" adaptions. For instance, we have a lot of character variables. Some have around 100 options, those often have a categorized version as well with 5 to 10 options. It is for us useful to make a distinction betwen them.

I was also thinking in making those categories a different amount of characters to help with checking. Additionally, the rest of the letters in the variable could be lowercase, This would help with automated validation.

I also made the link with Hungarian notation. The "types" are really Apps Hungarian, and you could even combine it with code. For instance, have rules about naming of dataframes or name vectors like you name columns etc..

At work we already a few less thought-through elements of all the above in the code-base at work and I'm still pondering / experimenting how to improve / integrate all of that. This post really helped with that process!

emilyriederer commented 1 year ago

@CorneeldH - thank you so much for taking the time to comment! I think one of the weirdest parts of data work for those of us that love to work "in the open" is how most data models, naming conventions, etc. are enterprise IP that can't be easily shared or learned from. I always love to hear what others are actually doing in practice and which of my ideas do or don't transport well to other settings. I'd love to keep in touch and hear how your experimentation evolves / what you land on!

CorneeldH commented 1 year ago

Ha Emily, thanks for the immediate response :)

Actually, I really agree with you. I'm been in the field for only a couple of years, but I'm having trouble finding "enterprise best practices". It is mostly bro-tutorials about some new cool model / package and very heavy cloud-tooling that will solve everything and that people sometimes try to push on other they are working with.

However, as you state there is a need for low-tech, low-friction things and in my mind some conceptual best practices and tools that can easily implement them.

Your note about IP is spot on, I'm afraid. The reason I'm reading this and was reading your article about packages as jobs is that I'm currently involved in some sector-wide knowledge sharing / cooperation things. To some extent we are sharing IP and we really have to think what elements in our way of work are transferable to others and how to build something that makes it easy for others to cooperate, while also have some boundaries / validation.

Anyway, first of, I hope to get the time between my regular analysis-work to make some progress conceptually and experiment with it, but when I have done so, I will write / publish something about it and let you know :-)

emilyriederer / website

Column Names as Contracts | Emily Riederer #9

Column Names as Contracts | Emily Riederer