Takes in a .csv file and formats it to be ready for machine learning using scikit-learn or machineJS
data-formatter
is designed to take care of the chores of machine learning to let you focus on the fun stuff!
Each column must have a label. Your options are few- this is designed to be easy!
Column label options:
For more detailed info, see below.
The first row holds information describing each column. Specifically, it must specify:
All other columns must be labeled as holding either Categorical or Continuous data:
This comes pre-bundled with machineJS
. To use it for other projects:
To include as a dependency for a specific repo:
npm install data-formatter
To use from the command line anywhere in your system:
npm install -g data-formatter
scikit-learn
, scikit-neuralnetwork
and brainjs
.data-formatter
formats both your training and your testing data in one fell swoop, so you never need to worry about re-formatting your data, or having incompatible features, once you start making predictions. data-formatter
will go through and replace missing values for you! Missing values in Continuous data columns will get replaced by the median value. Missing data points in Categorical data columns will get replaced by the mode of that column (most frequently occurring value). data-formatter
will remove all categorical values that are present for only one observation, which makes them useless for making predictions. This speeds up training time and fights against overfitting. Does some of that make your head spin? Have no idea what one (or more) of those bullet points means? No worries, that's the entire point of letting a library do this work for you!
Did any of the above get your heart racing and make you want to dive in to customize for your own project or kaggle competition? Awesome, follow along with mainPythonProcess.py
and customize to your heart's content, while still having in place a structure to automate the process for you!
The formatted data will be broken out into a number of different files, to be compatible with scikit-learn's API:
X_train_
: All of the X (non-output-column) features in the training setX_test_
: All of the X features in the testing (predicting) sety_train_
: All of the output columns for the training set. By definition, the testing/predicting data set has no output columns (they have to be predicted!).id_train_
: The ID column for the rows in the training data set. This prevents the ID column from being included as a feature when training a machine learning algorithm. id_test_
: The ID column for the rows in the testing data set. X_train_nn_
: All of the X features in the training data set, min-max normalized to have only values between -1 and 1.X_test_nn_
: All of the X features in the testing data set, min-max normalized to have only values between -1 and 1. Again, this is baked into machineJS, but if you're using it in a different project:
var df = require('data-formatter');
trainingData
and testingData
properties, and an optional callback:
df({
trainingData: full/absolute/path/to/training/data.csv,
testingData: full/absolute/path/to/testing/data.csv
}, callbackFunc);
The optional callback will be called once all data formatting has completed.
data-formatter relative/path/to/training/data.csv relative/path/to/testing/data.csv
Make sure that you have used the -g
flag when installing using npm if you want to use data-formatter
from the command line.
args
object with the following properties:trainingData
A full, absolute path to a .csv file. See above for more info on adding an additional dataDescription row to the .csv file itself above the header row.
testingData
The testing data. This file is assumed to only have a header row, not a dataDescription row. The columns must be in the same order as they are for the trainingData
file. This is almost always the case anyways.
joinFileName
[OPTIONAL]A full, absolute path to a .csv file that you would like to join in with the testing and training datasets. This file must have both a dataDescription and a header row. By default, it will be joined on any value in the headerRow that is shared across our training/testing dataset, and the join file.
outputFolder
[OPTIONAL]This property of the args
object is optaional. If included, all formatted files will be written to this folder. This folder will be created if it does not exist already.
DEFAULT: If a value is not passed in, this will default to creating a folder called data-formatterResults
in whichever directory this library is invoked in. This is designed to make files easy to find if, say, you invoke this library from a directory where you are already working on a machine learning project.
callback
[OPTIONAL]After the args object, you may choose to pass in a callback function that will be invoked once training is done. This parameter is optional. If provided, the callback function will be invoked with an object containing the file paths to all of the formatted data files created.
keepAllFeatures
[OPTIONAL]If you do not want to perform any feature selection, and keep all the features (both the ones in the original training data, and the ones created by data-formatter
), pass in true
for this flag
allFeatureCombinations
[OPTIONAL]This is still a beta feature. If you want to try adding all possible combinations of continuous features together, set this flag to true. Since it creates all possible combinations of all the continuous features, this can rapidly create a memory problem, and should only be used on small datasets, or if you have a ton of RAM.
As of the 1.2 release, data-formatter
can be invoked right from the command line.
npm install -g data-formatter
Note the "-g" flag directing npm to install the module globally. This makes it available from the command line throughout your entire file directory.
data-formatter path/to/training/data.csv path/to/testing/data.csv
The formatted data files will be written into whichever directory you invoke data-formatter
from.
If you find this library useful, you might want to check out machineJS, which helps reduce the drudge work of other parts of the machine learning process!
There are few things that make me as happy as reading through Pull Requests over a morning espresso :)
I've had a great time building this out so far. If you find it useful too, let me know by starring it!