Open Triamus opened 6 years ago
https://incisive.com/wp-content/uploads/downloads/whitepapers/Incisive_Automating_Data_Governance_WP.pdf https://github.com/FRosner/drunken-data-quality https://github.com/yandexdataschool/cms-dqm https://github.com/minkymorgan/bytefreq https://github.com/IQuOD/AutoQC https://github.com/eBay/griffin/blob/master/griffin-doc/proposal.md https://github.com/datacleaner/DataCleaner https://github.com/poldracklab/mriqc https://github.com/okfn/okfn.github.com/blob/master/blog/_posts/2016-05-17-automated-data-validation.md https://github.com/KaveIO/Eskapade https://github.com/OHDSI/Achilles https://github.com/agile-lab-dev/DataQuality Development Workflows for Data Scientists - Enabling Fast, Efficient, and Reproducible Results for Data Science Teams https://resources.github.com/downloads/development-workflows-data-scientists.pdf https://github.com/sabman/data-validation-ideas https://github.com/kenfar/DataGristle https://github.com/alecthomas/voluptuous https://github.com/data-cleaning/validate https://github.com/data-cleaning DATA VALIDATION IN SCADA SYSTEM https://www.theseus.fi/bitstream/handle/10024/17033/Opinnaytetyo_Puromaki_Toni.pdf?sequence=1 Literature Review of Data Validation Methods http://www.prepared-fp7.eu/viewer/file.aspx?fileinfoID=215 Methodology for data validation 1.0 Revised edition June 2016 Essnet Validat Foundation https://ec.europa.eu/eurostat/cros/system/files/methodology_for_data_validation_v1.0_rev-2016-06_final.pdf Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment https://idl.cs.washington.edu/files/2012-Profiler-AVI.pdf Data Quality Management - The Most Critical Initiative You Can Implement (SAS) http://www2.sas.com/proceedings/sugi29/098-29.pdf How-to: Do Data Quality Checks using Apache Spark DataFrames http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ http://www.questionflow.org/2017/11/20/store-data-about-rows/ http://www.questionflow.org/2017/11/28/rule-your-data-with-tidy-validation-reports-design/ https://tjmahr.github.io/nonstandard-eval-register-machines/ Idea: 3 data perspectives, row wise, column wise, whole dataset. Quality tool only measures quality but provides API for dashboards and apps e.g. an app may take measurement outcome and act on it e.g. send notification or intervene in data generating process. https://www.ebayinc.com/stories/blogs/tech/griffin-model-driven-data-quality-service-on-cloud-for-both-real-time-and-batch-data/ https://de.slideshare.net/mobile/HadoopSummit/using-hadoop-to-build-a-data-quality-service-for-both-realtime-and-batch-data http://colinfay.me/tidyeval-1/ https://edwinth.github.io/blog/nse/ https://maraaverick.rbind.io/2017/08/tidyeval-resource-roundup/ https://edwinth.github.io/blog/dplyr-recipes/ https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Promise-objects http://dplyr.tidyverse.org/articles/programming.html http://www.brodrigues.co/blog/2017-06-19-dplyr-0-70-tutorial/ https://romain.rbind.io/blog/2017/07/01/excluding-rows https://tjmahr.github.io/set-na-where-nonstandard-evaluation-use-case/ https://tjmahr.github.io/nonstandard-eval-register-machines/ https://adv-r.hadley.nz/meta https://cran.r-project.org/web/packages/datacheckr/index.html http://www.win-vector.com/blog/2017/06/non-standard-evaluation-and-function-composition-in-r/
https://stackoverflow.com/questions/47599865/how-do-i-combine-varying-input-variables-and-varying-functions-in-dplyr-summaris
https://griffin.incubator.apache.org/ https://github.com/apache/incubator-griffin Using Hadoop to build a Data Quality Service for both real-time and batch data https://www.slideshare.net/HadoopSummit/using-hadoop-to-build-a-data-quality-service-for-both-realtime-and-batch-data Griffin — Model-driven Data Quality Service on the Cloud for Both Real-time and Batch Data https://www.ebayinc.com/stories/blogs/tech/griffin-model-driven-data-quality-service-on-cloud-for-both-real-time-and-batch-data/ https://www.ebayinc.com/stories/blogs/tech/monitoring-anomalies-in-the-experimentation-platform/ https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-65?filter=allopenissues https://cwiki.apache.org/confluence/display/GRIFFIN/2.+Griffin+Job+Flow https://mvnrepository.com/artifact/org.apache.griffin
https://stackoverflow.com/questions/27545423/visual-structure-of-a-data-frame-locations-of-nas-and-much-more https://stackoverflow.com/questions/28813057/inspecting-and-visualizing-gaps-blanks-and-structure-in-large-dataframes https://cran.r-project.org/web/packages/VIM/index.html https://github.com/ropensci/visdat
https://stackoverflow.com/questions/11872499/create-an-expression-from-a-function-for-data-table-to-eval https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html https://stackoverflow.com/questions/8508482/what-does-sd-stand-for-in-data-table-in-r/8509301?stw=2#8509301
https://github.com/mailund/meta-programming-in-r/blob/master/chapters/06_quotes_and_substitution.md
docs
https://incisive.com/wp-content/uploads/downloads/whitepapers/Incisive_Automating_Data_Governance_WP.pdf https://github.com/FRosner/drunken-data-quality https://github.com/yandexdataschool/cms-dqm https://github.com/minkymorgan/bytefreq https://github.com/IQuOD/AutoQC https://github.com/eBay/griffin/blob/master/griffin-doc/proposal.md https://github.com/datacleaner/DataCleaner https://github.com/poldracklab/mriqc https://github.com/okfn/okfn.github.com/blob/master/blog/_posts/2016-05-17-automated-data-validation.md https://github.com/KaveIO/Eskapade https://github.com/OHDSI/Achilles https://github.com/agile-lab-dev/DataQuality Development Workflows for Data Scientists - Enabling Fast, Efficient, and Reproducible Results for Data Science Teams https://resources.github.com/downloads/development-workflows-data-scientists.pdf https://github.com/sabman/data-validation-ideas https://github.com/kenfar/DataGristle https://github.com/alecthomas/voluptuous https://github.com/data-cleaning/validate https://github.com/data-cleaning DATA VALIDATION IN SCADA SYSTEM https://www.theseus.fi/bitstream/handle/10024/17033/Opinnaytetyo_Puromaki_Toni.pdf?sequence=1 Literature Review of Data Validation Methods http://www.prepared-fp7.eu/viewer/file.aspx?fileinfoID=215 Methodology for data validation 1.0 Revised edition June 2016 Essnet Validat Foundation https://ec.europa.eu/eurostat/cros/system/files/methodology_for_data_validation_v1.0_rev-2016-06_final.pdf Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment https://idl.cs.washington.edu/files/2012-Profiler-AVI.pdf Data Quality Management - The Most Critical Initiative You Can Implement (SAS) http://www2.sas.com/proceedings/sugi29/098-29.pdf How-to: Do Data Quality Checks using Apache Spark DataFrames http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ http://www.questionflow.org/2017/11/20/store-data-about-rows/ http://www.questionflow.org/2017/11/28/rule-your-data-with-tidy-validation-reports-design/ https://tjmahr.github.io/nonstandard-eval-register-machines/ Idea: 3 data perspectives, row wise, column wise, whole dataset. Quality tool only measures quality but provides API for dashboards and apps e.g. an app may take measurement outcome and act on it e.g. send notification or intervene in data generating process. https://www.ebayinc.com/stories/blogs/tech/griffin-model-driven-data-quality-service-on-cloud-for-both-real-time-and-batch-data/ https://de.slideshare.net/mobile/HadoopSummit/using-hadoop-to-build-a-data-quality-service-for-both-realtime-and-batch-data http://colinfay.me/tidyeval-1/ https://edwinth.github.io/blog/nse/ https://maraaverick.rbind.io/2017/08/tidyeval-resource-roundup/ https://edwinth.github.io/blog/dplyr-recipes/ https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Promise-objects http://dplyr.tidyverse.org/articles/programming.html http://www.brodrigues.co/blog/2017-06-19-dplyr-0-70-tutorial/ https://romain.rbind.io/blog/2017/07/01/excluding-rows https://tjmahr.github.io/set-na-where-nonstandard-evaluation-use-case/ https://tjmahr.github.io/nonstandard-eval-register-machines/ https://adv-r.hadley.nz/meta https://cran.r-project.org/web/packages/datacheckr/index.html http://www.win-vector.com/blog/2017/06/non-standard-evaluation-and-function-composition-in-r/
stackoverflow
https://stackoverflow.com/questions/47599865/how-do-i-combine-varying-input-variables-and-varying-functions-in-dplyr-summaris
apache griffin
https://griffin.incubator.apache.org/ https://github.com/apache/incubator-griffin Using Hadoop to build a Data Quality Service for both real-time and batch data https://www.slideshare.net/HadoopSummit/using-hadoop-to-build-a-data-quality-service-for-both-realtime-and-batch-data Griffin — Model-driven Data Quality Service on the Cloud for Both Real-time and Batch Data https://www.ebayinc.com/stories/blogs/tech/griffin-model-driven-data-quality-service-on-cloud-for-both-real-time-and-batch-data/ https://www.ebayinc.com/stories/blogs/tech/monitoring-anomalies-in-the-experimentation-platform/ https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-65?filter=allopenissues https://cwiki.apache.org/confluence/display/GRIFFIN/2.+Griffin+Job+Flow https://mvnrepository.com/artifact/org.apache.griffin
visual representation of quality
https://stackoverflow.com/questions/27545423/visual-structure-of-a-data-frame-locations-of-nas-and-much-more https://stackoverflow.com/questions/28813057/inspecting-and-visualizing-gaps-blanks-and-structure-in-large-dataframes https://cran.r-project.org/web/packages/VIM/index.html https://github.com/ropensci/visdat
data.table way
https://stackoverflow.com/questions/11872499/create-an-expression-from-a-function-for-data-table-to-eval https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html https://stackoverflow.com/questions/8508482/what-does-sd-stand-for-in-data-table-in-r/8509301?stw=2#8509301
metaprogramming
https://github.com/mailund/meta-programming-in-r/blob/master/chapters/06_quotes_and_substitution.md