LinkedReproducibility; CSVW

https://arxiv.org/pdf/1811.01304.pdf :

Synthetic Columns In training, ColNet automatically extracts labeled samples from the KB. A training sample s .= (e, c) is composed of a synthetic column e and a class c in C, while a synthetic column is constructed by concatenating a specific number of entities.

https://westurner.github.io/hnlog/#comment-18957269 :

Featuretools https://github.com/Featuretools/featuretools

Featuretools is a python library for automated feature engineering. [using DFS: Deep Feature Synthesis]

auto-sklearn does feature selection (with e.g. PCA) in a "preprocessing" step; as well as "One-Hot encoding of categorical features, imputation of missing values and the normalization of features or samples" https://auto-ml.readthedocs.io/en/latest/deep_learning.html#feature-learning

auto_ml uses "Deep Learning [with Keras and TensorFlow] to learn features for us, and Gradient Boosting [with XGBoost] to turn those features into accurate predictions" https://automl.github.io/auto-sklearn/master/manual.html#turning-off-preprocessing

... "Ask HN: Data analysis workflow?" https://westurner.github.io/hnlog/#comment-18798244 :

Wrattler
- https://www.pachyderm.io/
- https://github.com/mwouts/jupytext
Dask-ML works with {scikit-learn, xgboost, tensorflow, TPOT,}. ETL is your responsibility. Loading things into parquet format affords a lot of flexibility in terms of (non-SQL) datastores or just efficiently packed files on disk that need to be paged into/over in RAM. http://ml.dask.org/examples/scale-scikit-learn.html

alan-turing-institute / SemAIDA

LinkedReproducibility; CSVW #2