FRosner / drunken-data-quality

Spark package for checking data quality
Apache License 2.0
222 stars 69 forks source link

Data Science 2.0 #56

Open FRosner opened 8 years ago

FRosner commented 8 years ago

Overview

We would like to communicate the idea of test driven data processing and data science. Maybe we can start with one or two meetups to get some feedback and then move to a conference.

Idea

Typical data science workflows start with a pipeline of data transformations, typically for preprocessing and feature engineering. Often it is followed by steps for model training, selection and application.

In the software development lifecycle, continuous integration and test driven development already has a lot of attention. It improves code quality and allows new developers to quickly get started with the code and make changes to it, without the fear of breaking existing functionality.

When working with real data in real world data science use cases, you will encounter problems when it comes to data quality from the beginning. But also the user defined transformations may introduce problems or errors. Why don't we apply the idea of continuous integration and test driven development also partly to the data science workflow?

When you then want to apply an existing transformation to a new version of your data source, the automated checks will tell you whether your following steps (like feature engineering) are still valid. It is the basic principle of failing fast and avoid bugs that are discovered at a later stage (e.g. at a model that performs badly).

Key Points

FRosner commented 8 years ago

@Gerrrr what do you think about one of the http://www.meetup.com/de/Spark-Munich/ meetups? Or shall we target a more general group? As DDQ is in the end a Spark library I would go for Spark Meetups.

FRosner commented 8 years ago

@Gerrrr I created a wiki page to draft some thoughts. If you would like to get write access, let me know :)

fabsta commented 8 years ago

Great idea! Let me know if you are planning a meeting and when/where that would be. I am from Munich as well.

FRosner commented 8 years ago

@fabsta thanks for your interest :)

We will have some internal discussions in December and hopefully I will have some time to draft a few slides over christmas. I will then try to set up a talk in some meetup group for next year :+1: