bbengfort / bbengfort.github.io

My Github Pages Repository
http://bbengfort.github.io/
Creative Commons Zero v1.0 Universal
4 stars 2 forks source link

Anonymizing User Profile Data with Faker #3

Closed bbengfort closed 8 years ago

bbengfort commented 8 years ago

Probably want to do some re-ordering so that the faker examples come first, then the CSV anonymization; things are structured pretty awkwardly right now.

bbengfort commented 8 years ago

The profile generator is not a real provider yet ... I couldn't get that to work.

mitzimichal commented 8 years ago

Hi Ben! Just submitted a first pass edited version... Needs some more TLC and some reorg probably but wanted to get your initial thoughts...

First-- wow. So good. Its amazing how humble (and dumb) I feel reading your stuff. Super cool.

And now to some comments...: 1) The main change I made was on the "generate fake data" section. It was too long and a bit tangential. I cut the "create a provider" section, didn't seem necessary. I was able to follow through the later code sections without it. I think getting from the Anonymizing example to "managing data quality" quicker is key (may also want to think of changing the order to get a more direct transition...?) 2) I think you may want to add a quick line on the edit distance method FuzzyWuzzy uses. I assumed its Levenshtein based on your pip command but I think its worth spelling out how similarity scores are calculated (even if its only through placing a link to documentation in a footnote). 3) Does everybody know what "hashable" means? I had to commence Googling to figure it out but your audience is probably more advanced than I am... If it is also intended for novice folks, you may want a footnote to some documentation on it...
4) data set or dataset? probably want to be consistent...

bbengfort commented 8 years ago

@mitzimichal thank you so much for your help on this! I'll incorporate your changes and give you an acknowledgement at the end of the post!

bbengfort commented 8 years ago

@mitzimichal oh sorry - would you mind doing a pull request so I can see your changes?

Also one note on "hashable" and the level of audience for this post: this post is directed at data engineers who have a lot of experience (since they'll be the ones doing the anonymization). You're right, they'll probably be less familiar with the distance metric, but the provider and code aspects will be important to them.

The next post you're going to read from me is specifically for the research lab about using the graph blocking approach. This post targets more intermediate and specifically data science folks. I had already made a footnote for hashable in this post (which we may share authorship on, more on that later). So I'll just copy that over to this post!

mitzimichal commented 8 years ago

Hi @bbengfort! I did a pull request yesterday right before I commented here... did that not go through? I'll try again...

mitzimichal commented 8 years ago

@bbengfort did you get my file/pull request? I tried to create another pull request from the edited file on my forked repository but it told me our masters are now identical... I'm probably doing something wrong, just not sure what :)

bbengfort commented 8 years ago

Still not seeing the pull request; I've always found the pull request process on github fairly black magic. It might have meant yours is as up to date as mine, which it is.

Don't worry about it too much; I can see your changes and I'll be editing the DDL version directly! Rebecca has also contributed some edits; no one reads my GitHub pages anyway!

Ben

On Monday, February 29, 2016, mitzimichal notifications@github.com wrote:

@bbengfort https://github.com/bbengfort did you get my file/pull request? I tried to create another pull request from the edited file on my forked repository but it told me our masters are now identical... I'm probably doing something wrong, just not sure what :)

— Reply to this email directly or view it on GitHub https://github.com/bbengfort/bbengfort.github.io/issues/3#issuecomment-190512978 .

Sent from Gmail Mobile

mitzimichal commented 8 years ago

haha sounds good :)