amidst / toolbox

A Java Toolbox for Scalable Probabilistic Machine Learning
http://www.amidsttoolbox.com
Apache License 2.0
119 stars 35 forks source link

Memory usage is too high #86

Closed gowthamnatarajan closed 6 years ago

gowthamnatarajan commented 6 years ago

The memory usage seems very high: My dataset has 100,000 sequences with an average length of 2 to 3 time_ids per sequence. Sometime there can be over 150 TIME_IDs per sequence. There are a total of 1.5 million rows in the arff file. There are 13 multi-nomial variables and 2 continues. There are 45 parent child relationships in total. It takes over 60 GB of ram. Any reason why? The final bn model file is less than 700 MB in size. Why does it take so much memory? And it keeps taking more and more memory as the dataset size increases (The structure of the dag remains the same).

Update: Does using parallel mode use more memory? Is it proportional to the number of cores in the system?

andresmasegosa commented 6 years ago

Please, provide an answer to the following questions, then I can give you more information.

a) Which is the maximum number of states of the 13 multinomial variables?

b) Which is the cardinality of the parent sets? Remember that the size of the network grows exponentially with the cardinality of the parent set.

c) You were using ParallelML, isn't? How many cores do you have on your computer?

d) Which is the mini-batch size you are using?

gowthamnatarajan commented 6 years ago

a) 41 b) Biggest table is 41*41 c) Yes parallel mode, with a 20 core dual slot CPU. Any way I can change this to just 10 threads and not 20? d) 1000. Does it take more memory with bigger batch size?

andresmasegosa commented 6 years ago

It is strange. Try to run this piece of code. It generates a temporal naive Bayes classifier. Tables are 2x41x41, 10^5 sequences of length 3. It runs with 1GB on my computer with 4 cores.

    //Generate a dynamic Naive Bayes with only Multinomial variables
    DynamicBayesianNetworkGenerator dbnGenerator = new DynamicBayesianNetworkGenerator();

    //Set the number of Discrete variables, their number of states, the number of Continuous variables
    dbnGenerator.setNumberOfContinuousVars(0);
    dbnGenerator.setNumberOfDiscreteVars(13);
    dbnGenerator.setNumberOfStates(41);

    //The number of states for the class variable is equal to 2
    DynamicBayesianNetwork dynamicNB = DynamicBayesianNetworkGenerator.generateDynamicNaiveBayes(new Random(0), 2, true);

    System.out.println(dynamicNB.getDynamicDAG().toString());
    System.out.println(dynamicNB.toString());

    //Sampling from the generated Dynamic NB
    DynamicBayesianNetworkSampler sampler = new DynamicBayesianNetworkSampler(dynamicNB);
    sampler.setSeed(0);

    //Sample from the dynamic NB given as inputs both nSequences (= 10000) and sequenceLength (= 100)

    DataStream<DynamicDataInstance> data = sampler.sampleToDataBase(100000,3);

    //Structure learning is excluded from the test, i.e., we use directly the initial Dynamic Naive Bayes network structure
    // and just apply then test parameter learning

    //Parameter Learning
       Stopwatch watch = Stopwatch.createStarted();

    ParameterLearningAlgorithm parallelMaximumLikelihood = new ParallelMaximumLikelihood();
    parallelMaximumLikelihood.setParallelMode(true);
    parallelMaximumLikelihood.setWindowsSize(1000);
    parallelMaximumLikelihood.setDynamicDAG(dynamicNB.getDynamicDAG());
    parallelMaximumLikelihood.initLearning();
    parallelMaximumLikelihood.updateModel(data);

    DynamicBayesianNetwork bnet = parallelMaximumLikelihood.getLearntDBN();

    System.out.println(watch.stop());
    System.out.println();

    //Check if the probability distributions of each node over both time 0 and T
    for (Variable var : dynamicNB.getDynamicVariables()) {
        System.out.println("\n---------- Variable " + var.getName() + " -----------");
        // time T
        System.out.println("\nTrue distribution at time T:\n"+ dynamicNB.getConditionalDistributionTimeT(var));
        System.out.println("\nLearned distribution at time T:\n"+ bnet.getConditionalDistributionTimeT(var));
        Assert.assertTrue(bnet.getConditionalDistributionTimeT(var).equalDist(dynamicNB.getConditionalDistributionTimeT(var), 0.05));
    }
gowthamnatarajan commented 6 years ago

I ran the same test and it passed, but I have 20 cores. It took 35GB of memory !!! What could be wrong? I used 1.6.1. I rerun again and it takes 19 GB. Then rerun again, it takes 10 GB. Then again and it goes up to 17GB. Why is it so variable? There is this variability even for my own data set.

gowthamnatarajan commented 6 years ago

Also, the memory usage constantly increases as the program runs.

andresmasegosa commented 6 years ago

Could you please run the test setting parallel mode to false? And also run the test setting the maximum memory of the Java heap to (-Xmx 1g) and report the comments. Thanks

gowthamnatarajan commented 6 years ago

I had set Xmx to 99GB (-Xms2048m -Xmx99000m) and it was taking a LOT of memory as mentioned above. Then I set both the settings to 2 GB and it never took more than 700 MB even with parallel mode as true and having 20 cores !!!. Is the garbage collector getting lazy when I give it too much memory? Never heard of this.

andresmasegosa commented 6 years ago

I think so. This is an issue with Java's garbage collector. I guess the new Java Streams API has something to do too.