MattTriano / analytics_data_where_house

An analytics engineering sandbox focusing on real estates prices in Cook County, IL
https://docs.analytics-data-where-house.dev/
GNU Affero General Public License v3.0
7 stars 0 forks source link

Should the system leave out the great_expectations anonymous_usage_statistics identifiers? #34

Open MattTriano opened 1 year ago

MattTriano commented 1 year ago

Per this GE docs page, the great_expectations team added a bit of code to enable them to track usage of their code, which can be disabled in the great_expectations.yml file. That page advises there's more information on a blog post from 2020, but the given link is dead. Still, per the wayback machine that post, the GE team states

"We do not track credentials, validation results, or arguments passed to Expectations. We consider these private, and frankly none of our business. User-created names are always hashed, to create a longitudinal record without leaking any private information. We track types of Expectations, to understand which are most useful to the community."

This is very reasonable and I'm keen to provide the GE team with information that helps them figure out what features are worth working on. However, as my project is intended to be both a specific project but also a platform that other people can fork and make their own pipelines for (but from the traffic page, I see people are mainly cloning the repo without forking), I don't know if I should strip out the UUID as it would produce a polluted longitudinal record.

So I should experiment with stripping out this UUID (both in /great_expectations/expectations/.ge_store_backend_id and .../great_expectations.yml files; per grep, all other appearances of the UUID are in the /.uncommitted/ dir) and see if anything complains when I run checkpoints.

MattTriano commented 1 year ago

After watching this video from one of the leading Rust-lang evangelists about a Go-lang improvement plan to use this kind of anonymous telemetry, I think it's unambiguously good to provide this kind of telemetry info back to the great_expectations maintainers, but I also don't want to send them polluted signals (by having potentially many different systems sending back telemetry info with the same not-really-UUID (IIRC, Universally Unique IDentifier)).

As those goals are in conflict, I'll have to weigh what I think is better (sending possibly diluted/polluted feedback or sending no feedback).