Open westurner opened 5 years ago
https://arxiv.org/pdf/1811.01304.pdf :
Synthetic Columns In training, ColNet automatically extracts labeled samples from the KB. A training sample s .= (e, c) is composed of a synthetic column e and a class c in C, while a synthetic column is constructed by concatenating a specific number of entities.
https://westurner.github.io/hnlog/#comment-18957269 :
Featuretools https://github.com/Featuretools/featuretools
Featuretools is a python library for automated feature engineering. [using DFS: Deep Feature Synthesis]
auto-sklearn does feature selection (with e.g. PCA) in a "preprocessing" step; as well as "One-Hot encoding of categorical features, imputation of missing values and the normalization of features or samples" https://auto-ml.readthedocs.io/en/latest/deep_learning.html#feature-learning
auto_ml uses "Deep Learning [with Keras and TensorFlow] to learn features for us, and Gradient Boosting [with XGBoost] to turn those features into accurate predictions" https://automl.github.io/auto-sklearn/master/manual.html#turning-off-preprocessing
... "Ask HN: Data analysis workflow?" https://westurner.github.io/hnlog/#comment-18798244 :
Dask-ML works with {scikit-learn, xgboost, tensorflow, TPOT,}. ETL is your responsibility. Loading things into parquet format affords a lot of flexibility in terms of (non-SQL) datastores or just efficiently packed files on disk that need to be paged into/over in RAM. http://ml.dask.org/examples/scale-scikit-learn.html
A few maybe useful resources to share. Interesting paper!
CSVW: CSV on the Web https://wrdrd.github.io/docs/consulting/knowledge-engineering#csvw
"#LinkedReproducibility" https://wrdrd.github.io/docs/consulting/linkedreproducibility#csv-csvw-and-metadata-rows
https://twitter.com/westurner/status/1032432593084534784