cerndb / dist-keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
http://joerihermans.com/work/distributed-keras/
GNU General Public License v3.0
623 stars 169 forks source link

Running out of HDD #43

Closed raviolli closed 6 years ago

raviolli commented 6 years ago

Hi, I am running out of HDD space when I use the trainer. The command from spark UI is: javaToPython at NativeMethodAccessorImpl.java:0 +details RDD: *Project [features_1hot#55, labels_1hot#56] +- Scan ExistingRDD[label_index#53L,features#54,features_1hot#55,labels_1hot#56]

org.apache.spark.sql.Dataset.javaToPython(Dataset.scala:2794) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:498) py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) py4j.Gateway.invoke(Gateway.java:280) py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) py4j.commands.CallCommand.execute(CallCommand.java:79) py4j.GatewayConnection.run(GatewayConnection.java:214) java.lang.Thread.run(Thread.java:748)

It runs till all the nodes run out of HDD space. What's going on here? Thanks,

P.S. This happened when I set EPOCH to 5000.

raviolli commented 6 years ago

I believe it happens bc of the large EPOCH creates a LARGE dataset (unionAll)

There must be a better way to EPOCH

JoeriHermans commented 6 years ago

Hi,

In the current version (if you pull the most recent commit from master), epoch handling is done by copying the Spark Iterators. So this is a resolved issue. However, as noted in Issue #35 this has other problems as well.

Joeri

raviolli commented 6 years ago

I think I installed via pip. Thanks.

JoeriHermans commented 6 years ago

Will update the pip repo, thanks for reminding me.

Joeri