cerndb / dist-keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
http://joerihermans.com/work/distributed-keras/
GNU General Public License v3.0
624 stars 169 forks source link

How to train keras features on non-redundant/infinite set of labels #68

Open anishsharma opened 6 years ago

anishsharma commented 6 years ago

I am developing a neural network in order to classify timeseries data. I know for timeseries LSTM would be right approach but in dist-keras where before passing it to a trainer, data has to be in spark dataframe format.

I am following this example LSTM and task here is to port this example to dist-keras. Timestep is 50 which means model would take 0-49 and predict 50 and so on. As you can see in the example that data is being pre-processed using numpy before being fed to keras. Since dist-keras requires data to be in spark dataframe, I have to take a different approach which is as follows:

I have straightaway created the DF:

X_train = train[:, :] y_train = train[:, -1] raw_dataset_train = sc.createDataFrame(X_train.tolist())

Above code will create a DF having 50 columns(timestep is 50) from 0 to _50.

Remove the _50 column which is the label in our case and then applying the vector assembler to all features:

features = raw_dataset_train.columns features.remove('_50') vector_assembler = VectorAssembler(inputCols=features, outputCol="features") dataset_train = vector_assembler.transform(raw_dataset_train)

Now, each row of DF contains 2 columns. First column contains the features and second contains the label(_50 column which I want to train on and later predict on). As I see it, it become a classification problem. My issues are below:

If my approach is right, then how would I defines output label for my data as their are no finite number here for output column. It could be same number as number of rows in DF.

Do I still need LSTM layers in my model? I am asking this because I have processed the data in non-lstm kind of way.(At least that is what I think. I might be wrong.)

Please advice and let me know if you need more clarification or information on this.