Triamus / play

play repo for experiments (mainly with git)
1 stars 0 forks source link

Data quality #16

Open Triamus opened 6 years ago

Triamus commented 6 years ago

docs

https://incisive.com/wp-content/uploads/downloads/whitepapers/Incisive_Automating_Data_Governance_WP.pdf https://github.com/FRosner/drunken-data-quality https://github.com/yandexdataschool/cms-dqm https://github.com/minkymorgan/bytefreq https://github.com/IQuOD/AutoQC https://github.com/eBay/griffin/blob/master/griffin-doc/proposal.md https://github.com/datacleaner/DataCleaner https://github.com/poldracklab/mriqc https://github.com/okfn/okfn.github.com/blob/master/blog/_posts/2016-05-17-automated-data-validation.md https://github.com/KaveIO/Eskapade https://github.com/OHDSI/Achilles https://github.com/agile-lab-dev/DataQuality Development Workflows for Data Scientists - Enabling Fast, Efficient, and Reproducible Results for Data Science Teams https://resources.github.com/downloads/development-workflows-data-scientists.pdf https://github.com/sabman/data-validation-ideas https://github.com/kenfar/DataGristle https://github.com/alecthomas/voluptuous https://github.com/data-cleaning/validate https://github.com/data-cleaning DATA VALIDATION IN SCADA SYSTEM https://www.theseus.fi/bitstream/handle/10024/17033/Opinnaytetyo_Puromaki_Toni.pdf?sequence=1 Literature Review of Data Validation Methods http://www.prepared-fp7.eu/viewer/file.aspx?fileinfoID=215 Methodology for data validation 1.0 Revised edition June 2016 Essnet Validat Foundation https://ec.europa.eu/eurostat/cros/system/files/methodology_for_data_validation_v1.0_rev-2016-06_final.pdf Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment https://idl.cs.washington.edu/files/2012-Profiler-AVI.pdf Data Quality Management - The Most Critical Initiative You Can Implement (SAS) http://www2.sas.com/proceedings/sugi29/098-29.pdf How-to: Do Data Quality Checks using Apache Spark DataFrames http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ http://www.questionflow.org/2017/11/20/store-data-about-rows/ http://www.questionflow.org/2017/11/28/rule-your-data-with-tidy-validation-reports-design/ https://tjmahr.github.io/nonstandard-eval-register-machines/ Idea: 3 data perspectives, row wise, column wise, whole dataset. Quality tool only measures quality but provides API for dashboards and apps e.g. an app may take measurement outcome and act on it e.g. send notification or intervene in data generating process. https://www.ebayinc.com/stories/blogs/tech/griffin-model-driven-data-quality-service-on-cloud-for-both-real-time-and-batch-data/ https://de.slideshare.net/mobile/HadoopSummit/using-hadoop-to-build-a-data-quality-service-for-both-realtime-and-batch-data http://colinfay.me/tidyeval-1/ https://edwinth.github.io/blog/nse/ https://maraaverick.rbind.io/2017/08/tidyeval-resource-roundup/ https://edwinth.github.io/blog/dplyr-recipes/ https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Promise-objects http://dplyr.tidyverse.org/articles/programming.html http://www.brodrigues.co/blog/2017-06-19-dplyr-0-70-tutorial/ https://romain.rbind.io/blog/2017/07/01/excluding-rows https://tjmahr.github.io/set-na-where-nonstandard-evaluation-use-case/ https://tjmahr.github.io/nonstandard-eval-register-machines/ https://adv-r.hadley.nz/meta https://cran.r-project.org/web/packages/datacheckr/index.html http://www.win-vector.com/blog/2017/06/non-standard-evaluation-and-function-composition-in-r/

stackoverflow

https://stackoverflow.com/questions/47599865/how-do-i-combine-varying-input-variables-and-varying-functions-in-dplyr-summaris

apache griffin

https://griffin.incubator.apache.org/ https://github.com/apache/incubator-griffin Using Hadoop to build a Data Quality Service for both real-time and batch data https://www.slideshare.net/HadoopSummit/using-hadoop-to-build-a-data-quality-service-for-both-realtime-and-batch-data Griffin — Model-driven Data Quality Service on the Cloud for Both Real-time and Batch Data https://www.ebayinc.com/stories/blogs/tech/griffin-model-driven-data-quality-service-on-cloud-for-both-real-time-and-batch-data/ https://www.ebayinc.com/stories/blogs/tech/monitoring-anomalies-in-the-experimentation-platform/ https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-65?filter=allopenissues https://cwiki.apache.org/confluence/display/GRIFFIN/2.+Griffin+Job+Flow https://mvnrepository.com/artifact/org.apache.griffin

visual representation of quality

https://stackoverflow.com/questions/27545423/visual-structure-of-a-data-frame-locations-of-nas-and-much-more https://stackoverflow.com/questions/28813057/inspecting-and-visualizing-gaps-blanks-and-structure-in-large-dataframes https://cran.r-project.org/web/packages/VIM/index.html https://github.com/ropensci/visdat

data.table way

https://stackoverflow.com/questions/11872499/create-an-expression-from-a-function-for-data-table-to-eval https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html https://stackoverflow.com/questions/8508482/what-does-sd-stand-for-in-data-table-in-r/8509301?stw=2#8509301

metaprogramming

https://github.com/mailund/meta-programming-in-r/blob/master/chapters/06_quotes_and_substitution.md