OutOfMemoryError when fitting models in Thread or ExecutorService

copeg commented 7 years ago

Issue Description

I am attempting to train models in parallel using threading or ExecutorService. In doing so, process memory seems to continually grow until an OutOfMemoryError is thrown. Below is the code (adapted from the deeplearning4j regression example), VM arguments, pom, and exception (note for demonstration, this example does not necessarily train models in parallel but reproduces the problem on my system nonetheless)

import java.util.Collections;
import java.util.LinkedList;
import java.util.List;
import java.util.Random;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;

import org.deeplearning4j.datasets.iterator.impl.ListDataSetIterator;
import org.deeplearning4j.nn.api.OptimizationAlgorithm;
import org.deeplearning4j.nn.conf.MultiLayerConfiguration;
import org.deeplearning4j.nn.conf.NeuralNetConfiguration;
import org.deeplearning4j.nn.conf.layers.DenseLayer;
import org.deeplearning4j.nn.conf.layers.OutputLayer;
import org.deeplearning4j.nn.multilayer.MultiLayerNetwork;
import org.deeplearning4j.nn.weights.WeightInit;
import org.deeplearning4j.optimize.listeners.ScoreIterationListener;
import org.nd4j.linalg.activations.Activation;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.api.ops.impl.transforms.Sin;
import org.nd4j.linalg.dataset.DataSet;
import org.nd4j.linalg.dataset.api.iterator.DataSetIterator;
import org.nd4j.linalg.factory.Nd4j;
import org.nd4j.linalg.learning.config.Nesterovs;
import org.nd4j.linalg.lossfunctions.LossFunctions;

public class Dl4jThreadingExample {

    //Random number generator seed, for reproducability
    public static final int seed = 12345;
    //Number of iterations per minibatch
    public static final int iterations = 1;
    //Number of epochs (full passes of the data)
    public static final int nEpochs = 100;
    //How frequently should we plot the network output?
    public static final int plotFrequency = 500;
    //Number of data points
    public static final int nSamples = 1000;
    //Batch size: i.e., each epoch has nSamples/batchSize parameter updates
    public static final int batchSize = 100;
    //Network learning rate
    public static final double learningRate = 0.01;
    public static final Random rng = new Random(seed);
    public static final int numInputs = 1;
    public static final int numOutputs = 1;

    public static final int SYNCHRONOUS = 1;
    public static final int ASYNCHRONOUS = 2;
    public static final int THREAD_POOLED = 3;

    public static int THREADING_TYPE = ASYNCHRONOUS;

    public static final int numberFits = 50;

    public static void main(final String[] args) throws Exception{

        ExecutorService pool = Executors.newFixedThreadPool(1);

        List<Future> futures = new LinkedList<Future>();
        for ( int i = 0; i < numberFits; i++ ){
            final Fitter fitter = new Fitter(i);
            if ( THREADING_TYPE == SYNCHRONOUS ){
                fitter.call();
            }else if ( THREADING_TYPE == ASYNCHRONOUS ){
                Thread t = new Thread(new Runnable(){

                    @Override
                    public void run() {
                        try {
                            fitter.call();
                        } catch (Exception e) {
                            e.printStackTrace();
                        }
                    }

                });
                t.setDaemon(true);
                t.start();
                synchronized(t){
                    t.wait();
                }

            }else{
                futures.add(pool.submit(fitter));
            }

        }

        pool.shutdown();

        while ( futures.size() > 0 ){
            Future f = futures.remove(0);
            Object o = f.get();
        }

        Thread.sleep(1000);

    }

    private static class Fitter implements Callable{

        private int i;
        public Fitter(int i){
            this.i = i;
        }

        @Override
        public Object call() throws Exception {
            //Switch these two options to do different functions with different networks
            final MathFunction fn = new SinXDivXMathFunction();

            //Generate the training data
            final INDArray x = Nd4j.linspace(-10,10,nSamples).reshape(nSamples, 1);
            final DataSetIterator iterator = getTrainingData(x,fn,batchSize,rng);

            final MultiLayerConfiguration conf = getDeepDenseLayerNetworkConfiguration();
            //Create the network
            final MultiLayerNetwork net = new MultiLayerNetwork(conf);
            net.init();
            net.setListeners(new ScoreIterationListener(1));

            //Train the network on the full data set, and evaluate in periodically
            final INDArray[] networkPredictions = new INDArray[nEpochs/ plotFrequency];
            for( int i=0; i<nEpochs; i++ ){
                iterator.reset();
                net.fit(iterator);
                if((i+1) % plotFrequency == 0) networkPredictions[i/ plotFrequency] = net.output(x, false);
            }
            //x.cleanup();

            System.out.println("Completed " + i);
            return null;
        }

    }

    /** Returns the network configuration, 2 hidden DenseLayers of size 50.
     */
    private static MultiLayerConfiguration getDeepDenseLayerNetworkConfiguration() {
        final int numHiddenNodes = 50;
        return new NeuralNetConfiguration.Builder()
                .seed(seed)
                .iterations(1)
                .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
                .learningRate(learningRate)
                .weightInit(WeightInit.XAVIER)
                .updater(new Nesterovs(0.9))
                .list()
                .layer(0, new DenseLayer.Builder().nIn(numInputs).nOut(numHiddenNodes)
                        .activation(Activation.TANH).build())
                .layer(1, new DenseLayer.Builder().nIn(numHiddenNodes).nOut(numHiddenNodes)
                        .activation(Activation.TANH).build())
                .layer(2, new OutputLayer.Builder(LossFunctions.LossFunction.MSE)
                        .activation(Activation.IDENTITY)
                        .nIn(numHiddenNodes).nOut(numOutputs).build())
                .pretrain(false).backprop(true).build();
    }

    /** Create a DataSetIterator for training
     * @param x X values
     * @param function Function to evaluate
     * @param batchSize Batch size (number of examples for every call of DataSetIterator.next())
     * @param rng Random number generator (for repeatability)
     */
    private static DataSetIterator getTrainingData(final INDArray x, final MathFunction function, final int batchSize, final Random rng) {
        final INDArray y = function.getFunctionValues(x);
        final DataSet allData = new DataSet(x,y);

        final List<DataSet> list = allData.asList();
        Collections.shuffle(list,rng);
        return new ListDataSetIterator(list,batchSize);
    }

    public static class SinXDivXMathFunction implements MathFunction {

        @Override
        public INDArray getFunctionValues(final INDArray x) {
            return Nd4j.getExecutioner().execAndReturn(new Sin(x.dup())).div(x);
        }

        @Override
        public String getName() {
            return "SinXDivX";
        }
    }

    public static interface MathFunction {

        INDArray getFunctionValues(INDArray x);

        String getName();
    }
}

The following VM arguments are passed to the JVM (low memory values so that the problem shows up in a reasonable amount of time)

-Xms256m -Xmx512M -Dorg.bytedeco.javacpp.maxbytes=1G -Dorg.bytedeco.javacpp.maxPhysicalBytes=1G

pom.xml

    <dependency>
        <groupId>org.deeplearning4j</groupId>
        <artifactId>deeplearning4j-core</artifactId>
        <version>0.9.1</version>
    </dependency>
    <dependency>
        <groupId>org.nd4j</groupId>
        <artifactId>nd4j-native-platform</artifactId>
        <version>0.9.1</version>
    </dependency>

The code can be ran under different methods by changing THREADING_TYPE to one of SYNCHRONOUS, ASYNCHRONOUS, or THREAD_POOLED, the first option behaves fine while the latter two throw the following exception

Exception in thread "Thread-21" java.lang.OutOfMemoryError: Cannot allocate new FloatPointer(1000): totalBytes = 76M, physicalBytes = 912M
    at org.bytedeco.javacpp.FloatPointer.<init>(FloatPointer.java:76)
    at org.nd4j.linalg.api.buffer.BaseDataBuffer.<init>(BaseDataBuffer.java:541)
    at org.nd4j.linalg.api.buffer.FloatBuffer.<init>(FloatBuffer.java:61)
    at org.nd4j.linalg.api.buffer.factory.DefaultDataBufferFactory.createFloat(DefaultDataBufferFactory.java:255)
    at org.nd4j.linalg.factory.Nd4j.createBuffer(Nd4j.java:1468)
    at org.nd4j.linalg.api.ndarray.BaseNDArray.<init>(BaseNDArray.java:260)
    at org.nd4j.linalg.cpu.nativecpu.NDArray.<init>(NDArray.java:122)
    at org.nd4j.linalg.cpu.nativecpu.CpuNDArrayFactory.createUninitialized(CpuNDArrayFactory.java:267)
    at org.nd4j.linalg.factory.Nd4j.createUninitialized(Nd4j.java:5054)
    at org.nd4j.linalg.api.shape.Shape.toOffsetZeroCopyHelper(Shape.java:247)
    at org.nd4j.linalg.api.shape.Shape.toOffsetZeroCopy(Shape.java:204)
    at org.nd4j.linalg.api.ndarray.BaseNDArray.dup(BaseNDArray.java:1702)
    at com.codexis.mosaic.deeplearning.DLExecutors$SinXDivXMathFunction.getFunctionValues(DLExecutors.java:189)
    at DLExecutors.getTrainingData(DLExecutors.java:177)
    at DLExecutors.access$0(DLExecutors.java:176)
    at DLExecutors$Fitter.call(DLExecutors.java:121)
    at DLExecutors$1.run(DLExecutors.java:76)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes = 912M > maxPhysicalBytes = 911M
    at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:576)
    at org.bytedeco.javacpp.Pointer.init(Pointer.java:121)
    at org.bytedeco.javacpp.FloatPointer.allocateArray(Native Method)
    at org.bytedeco.javacpp.FloatPointer.<init>(FloatPointer.java:68)
    ... 17 more

Version Information

deeplearning4j and Nd4j 0.9.1 Java 1.7 CPU only Windows 7, 64-bit, 40Gb RAM, Intel Xeon CPU E5-2620 @2.0GHz (2x)

Visual VM for SYNCHRONOUS:

Visual VM for ASYNCHRONOUS: threading

Process (off-heap) memory just seems to continually grow when models are fit in a separate Thread

raver119 commented 7 years ago

What happens if you enable workspaces?

copeg commented 7 years ago

Adding either .trainingWorkspaceMode(WorkspaceMode.SINGLE) or .trainingWorkspaceMode(WorkspaceMode.SEPARATE) results in the same behavior

copeg commented 7 years ago

At the moment OOM errors go away if I manually set the OMP_NUM_THREADS environment variable to a low value (eg 1). Larger values, or not setting at all, result in application memory usage that just climbs up until OOM error is thrown. There is an added benefit in that setting this to 1 results in several fold improvement in efficiency relative to not setting at all...there is a downside in that this is not an ideal fix for deployment in some or our settings (client side application, where this Env variable will need to be set on each client - no way to do so via Java AFAIK)

raver119 commented 7 years ago

that makes no sense.

what's your os and openmp implementation?

пт, 22 сент. 2017 г. в 19:47, Greg notifications@github.com:

At the moment OOM errors go away if I manually set the OMP_NUM_THREADS environment variable to a low value (eg 1). Larger values, or not setting at all, result in application memory usage that just climbs up until OOM error is thrown. There is an added benefit in that setting this to 1 results in several fold improvement in efficiency relative to not setting at all...there is a downside in that this is not an ideal fix for deployment in some or our settings (client side application, where this Env variable will need to be set on each client - no way to do so via Java AFAIK)

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/deeplearning4j/deeplearning4j/issues/4098#issuecomment-331499678, or mute the thread https://github.com/notifications/unsubscribe-auth/ALru_wDk3mdf71dpQIdPktMzsCN7vEkCks5sk-SHgaJpZM4Pfq9I .

copeg commented 7 years ago

OS and specs are in original post, copied here: Windows 7, 64-bit, 40Gb RAM, Intel Xeon CPU E5-2620 @2.0GHz (2x) OpenMP...forgive me because I'm a bit in the dark here. I'm just adding the deeplearning4j and nd4j dependencies to my pom and running through Eclipse.

cinqs commented 7 years ago

Hi, same happens here with my app.

I have 36 features with binary output classification problem

training source files have 250_000 lines of double values

public static MultiLayerConfiguration getClassificationMLPConf(double learningRate, int inputNum, int outputNum, int[] hiddenLayers) {
        ListBuilder lb = new NeuralNetConfiguration.Builder().seed(123)
                .learningRate(learningRate)
                .iterations(1)
                .updater(Updater.NESTEROVS)
                .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
                .list();

        lb.layer(0, new DenseLayer.Builder()
                .nIn(inputNum)
                .nOut(hiddenLayers[0])
                .activation(Activation.RELU)
                .weightInit(WeightInit.XAVIER)
                .build());

        for(int i = 0; i < hiddenLayers.length - 1; i++) {
            int nIn = hiddenLayers[i];
            int nOut = hiddenLayers[i + 1];
            int layerInt = i + 1;

            lb.layer(layerInt, new DenseLayer.Builder()
                    .nIn(nIn)
                    .nOut(nOut)
                    .activation(Activation.RELU)
                    .weightInit(WeightInit.XAVIER)
                    .build());
        }

        lb.layer(hiddenLayers.length, new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
                .nIn(hiddenLayers[hiddenLayers.length - 1])
                .nOut(outputNum)
                .activation(Activation.SOFTMAX)
                .weightInit(WeightInit.XAVIER)
                .build());

        return lb.backprop(true).pretrain(false).build();
    }

With invoker Dl4jUtils.getClassificationMLPConf(0.01, 36, 2, new int[] {75, 75, 75})

when loops the fitting process 600 times in a newly created thread, the memory consumption will continue increasing... until out of memory...

My machine configurations:

Intel (R) Xeon(R) CPU E7-4850 v4 - 2 processes
32 GB memory
Windows Server 2012 R2 Standard

cinqs commented 7 years ago

Hi, I also confirm that when running fitting processes in main thread, the memory consumption is quite normal, took about 3GB in my situation.

Here below is my POM:

        <dependency>
          <groupId>org.nd4j</groupId>
          <artifactId>nd4j-native-platform</artifactId>
          <version>0.9.1</version>
    </dependency>
    <dependency>
            <groupId>org.deeplearning4j</groupId>
            <artifactId>deeplearning4j-core</artifactId>
            <version>0.9.1</version>
        </dependency>

raver119 commented 7 years ago

So, in other words: everything works fine in single thread, and goes mad in parallel fitting threads IF openmp is used.

I'll try to profile it.

cinqs commented 7 years ago

Hi, Just tried using these two configurations:

.inferenceWorkspaceMode(WorkspaceMode.
.trainingWorkspaceMode(WorkspaceMode.SEPARATE)

but this doesn't solve my problem of oom.

raver119 commented 7 years ago

Can you confirm my words in previous comment?

I wonder, if "Java single thread + multiple OpenMP threads" work fine, and "Java multuple threads + multiple OpenMP threads" cause oom.

cinqs commented 7 years ago

I don't know much about OpenMP and the implementation behind DL4j.

In my situation:

WORK FINE: running the fit process in the JAVA main thread(also the thread name)

for(int i = 0; i < epocNum; i++) {
    theNet.fit(dsi);
}

means running these codes in the main thread

OOM problem: running the fit process in a created thread like running the above lines in the below created thread:


Thread t = new Thread(new Executor());
t.setName("C-Executor-" + getAndUpdateExecutorCount());
t.start();

private class Executor implements Runnable {

        private String folderPath;
        private int classIndex;
        private int classCount;
        private MultiLayerNetwork theNet;

        Executor() {
            folderPath = resources.get("folderPath").toString();
            classIndex = Integer.parseInt(resources.get("classIndex").toString());
            classCount = Integer.parseInt(resources.get("classCount").toString());
            theNet = new MultiLayerNetwork(
                    Dl4jUtils.getClassificationMLPConf(learningRate, inputNum, outputNum, hiddenLayers));
            theNet.init();
            theNet.setListeners(new ScoreIterationListener(1000));
        }

        @Override
        public void run() {
            Random rand = new Random(123L);
            int trainFileIndex = rand.nextInt(3);
            DataSetIterator dsi = null;
            try {
                dsi = Dl4jUtils.getClassificationDataSetIter(
                        folderPath + trainFileIndex + ".csv", 
                        500, classIndex, classCount);
            } catch (IOException | InterruptedException e) {
                logger.warn("", e);
            }

            for(int i = 0; i < epocNum; i++) {
                theNet.fit(dsi);
            }

            try {
                ModelSerializer.writeModel(theNet, "....\\theNet.net", true);
            } catch (IOException e1) {
                e1.printStackTrace();
            }

            for(int i = 0; i < 2; i++) {
                if(i == trainFileIndex)
                    continue;

                try {
                    dsi = Dl4jUtils.getClassificationDataSetIter(
                            folderPath + i + ".csv", 1000, classIndex, classCount);
                } catch (IOException | InterruptedException e) {
                    logger.warn("", e);
                }

                analyzeAndSaveResult(dsi, i);
            }
        }

        private void analyzeAndSaveResult(DataSetIterator dsi, int i) {
            Evaluation eval = new Evaluation(outputNum);
            while(dsi.hasNext()) {
                DataSet ds = dsi.next();

                INDArray features = ds.getFeatures();
                INDArray labels = ds.getLabels();
                INDArray pred = theNet.output(features, false);

                eval.eval(labels, pred);
            }

            logger.info(eval.stats());

            try(BufferedWriter bw = new BufferedWriter(new FileWriter("....." + i + ".csv"))) {
                bw.write(eval.stats());
            } catch (IOException e) {
                logger.warn("", e);
            }
        }
    }

I didn't implement OpenMP myself, I assume that this should be some implementation behind DL4J? isn't it?

raver119 commented 7 years ago

you modify OpenMP behavior with OMP_NUM_THREADS environment variable. If you don't set it - app uses default settings, and reports number of threads equal to your number of cores at start up. If you set this variable to something - that value will be used instead.

cinqs commented 7 years ago

Hi, just checked to set OS wide env OMP_NUM_THREADS = 1, and so far so good.

But is it the only way to solve so? I mean touching the OS wide environment variable?

agibsonccc commented 7 years ago

@cinqs you can set it for your app too. No environment variable has to be be permanently set..that's unix 101 here. Depending on whether you're running in intellij or your server app, just run things within your environment with that config. Please remove that if you can, it's silly to set any env variable for specific apps at a global level..

Openmp (please do NOT ignore what it is and what it does, that's actually your problem) has some issues with multi threading because it itself is multi threaded for loops internally. What you're running in to here is the for loops in openmp clashing with the behavior of the ones in java, since the ones in java are actually real threads. Setting the number of threads to 1 is very common when running spark applications and other kinds of multi threading.

cinqs commented 7 years ago

@agibsonccc, Hi, yes, absolutely I will remove the OS-wide env variable, thanks for reminding. I will try to set it inside the app.

Besides, I read some of OpenMP's docs, but it's too much for me to digest in a while, do you have any documents in your websites' doc centre which would explain your implementation on OpenMP? some relative docs specifically for DL4J? it would also be nice if you can recommend some docs to narrow the scope of OpenMP. thanks.

just a reminder, you can remove the bug tag since this is absolutely not a bug... maybe something called enhancement...

cinqs commented 7 years ago

Ok, I found it.

https://deeplearning4j.org/native

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

deeplearning4j / deeplearning4j

OutOfMemoryError when fitting models in Thread or ExecutorService #4098

Issue Description

Version Information