Null Pointer Exception Spark Word2Vec

dmmiller612 commented 7 years ago

Currently I am getting this exception when running Spark's word2vec job: Exception in thread "main" java.lang.NullPointerException at org.nd4j.linalg.api.shape.Shape.elementWiseStride(Shape.java:1647) at org.nd4j.linalg.api.ndarray.BaseNDArray.elementWiseStride(BaseNDArray.java:755) at org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.exec(NativeOpExecutioner.java:502) at org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.exec(NativeOpExecutioner.java:55) at org.nd4j.linalg.api.ndarray.BaseNDArray.addi(BaseNDArray.java:2876) at org.nd4j.linalg.api.ndarray.BaseNDArray.addi(BaseNDArray.java:2853) at org.deeplearning4j.spark.models.embeddings.word2vec.Word2Vec.train(Word2Vec.java:217) at com.inin.sparkscripts.jobs.Word2VecJob.performJob(Word2VecJob.java:54) at com.inin.sparkscripts.jobs.Word2VecJob.main(Word2VecJob.java:61) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The code that I am using is:

        JavaRDD<String> corpus = SparkConfig.getSparkContext()
                .textFile("corpus/*.txt")
                .map(item -> item.trim().replace(" +", " "))
                .filter(item -> item != null && !item.equals(""));
        TokenizerFactory t = new DefaultTokenizerFactory();
        t.setTokenPreProcessor(new CommonPreprocessor());
        Word2Vec word2Vec = new Word2Vec.Builder()
                .minWordFrequency(5)
                .iterations(1)
                .layerSize(100)
                .seed(42)
                .windowSize(5)
                .tokenizerFactory(t)
                .build();
        word2Vec.train(corpus);

When doing some debugging, it looks like ShapeInformation is null in the BaseNDArray. I looked at the JavaRDD, and everything has at least one String. If the bug is on my end, it may be nice to try and put a custom exception in (I can even do that), so it is easier to fix in the future.

raver119 commented 7 years ago

Are you using spark-submit or something like that? Probably yes.

It looks like ser/de issue, but at this moment i'm not sure how you got it, nobody reported it before.

dmmiller612 commented 7 years ago

Ah yes

spark-submit --class com.inin.sparkscripts.jobs.Word2VecJob --master local[2] --driver-memory 6g --executor-memory 4g target/qmScript-jar-with-dependencies.jar

dmmiller612 commented 7 years ago

I am wondering if it could be some excessive symbols. Here is a list that can be parallelized that causes the issue.

    List<String> items = Lists.newArrayList(
            "December 1998",
            "Your contribution to Goodwill will mean more than you may know.",
            "To help you see how much your contribution means, I'm sharing with you",
            "The words of people who have lived Goodwill's mission. We want you to",
            "Know why your support of Goodwill is so important.",
            "Your gift to Goodwill is important because people with physical and",
            "Mental disabilities sometimes need an extra hand to know the pride that",
            "comes with work.",
            "\"I was sad when I couldn't go to the snack bar to buy a soda. Now I",
            "can buy a soda and spend money. I like working and making money. I have a",
            "savings account. I can write my name on the deposit slip. If I wasn't",
            "working here ...I would be sad ...\" -- Maureen",
            "Because turning welfare recipients into tax payers just makes",
            "sense.",
            "\"When I first came to Goodwill I was a single parent with little or no",
            "self-esteem. I was on welfare and without my diploma. Coming to Goodwill",
            "was the first step toward my becoming totally independent. I am now",
            "...totally off of welfare. I really like my job.\" -- Sherry",
            "Because people want to work.",
            "\"I'd never finished high school. I had no experience or skills ... The",
            "only thing I did know for sure was here's a chance to change things for",
            "me and my children... I rode a bike to Goodwill in the rain and snow. I",
            "wanted to be there ...I had my second chance to change my life.\" --",
            "Donna",
            "Because teaching a man to fish will keep him fed for his entire"
    );

Ha also worth noting, these are words from a very large public corpus (MASC). So nothing personal in here.

EDIT:

Looks like when removing punctuation, the problem still exists. Really suspecting that it is related to the the other issue I opened here: https://github.com/deeplearning4j/deeplearning4j/issues/2347 . I may see if the new version fixes it, when its out.

Here is the log trace right before the failure, which makes me think the issues may be related:

16/11/21 12:41:55 INFO NativeOps: Number of threads used for NativeOps: 4
Unable to guess runtime. Please set OMP_NUM_THREADS or equivalent manually.
16/11/21 12:41:55 INFO Nd4jBlas: Number of threads used for BLAS: 4
16/11/21 12:41:56 INFO Reflections: Reflections took 607 ms to scan 2 urls, producing 123 keys and 419 values 
16/11/21 12:41:56 INFO Executor: Finished task 0.0 in stage 5.0 (TID 17). 3656 bytes result sent to driver
16/11/21 12:41:56 INFO Executor: Finished task 3.0 in stage 5.0 (TID 20). 3921 bytes result sent to driver
16/11/21 12:41:56 INFO Executor: Finished task 1.0 in stage 5.0 (TID 18). 3846 bytes result sent to driver
16/11/21 12:41:56 INFO Executor: Finished task 2.0 in stage 5.0 (TID 19). 3921 bytes result sent to driver
16/11/21 12:41:56 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 17) in 1381 ms on localhost (1/4)
16/11/21 12:41:56 INFO TaskSetManager: Finished task 3.0 in stage 5.0 (TID 20) in 1376 ms on localhost (2/4)
16/11/21 12:41:56 INFO TaskSetManager: Finished task 1.0 in stage 5.0 (TID 18) in 1381 ms on localhost (3/4)
16/11/21 12:41:56 INFO TaskSetManager: Finished task 2.0 in stage 5.0 (TID 19) in 1379 ms on localhost (4/4)
16/11/21 12:41:56 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 
16/11/21 12:41:56 INFO DAGScheduler: ResultStage 5 (collect at Word2Vec.java:207) finished in 1.385 s
16/11/21 12:41:56 INFO DAGScheduler: Job 5 finished: collect at Word2Vec.java:207, took 1.399808 s
16/11/21 12:41:56 INFO Word2Vec: Averaging results...

raver119 commented 7 years ago

That's INDArray ser/de issue, i've said that before. Probably kryo module is required in your environment? Can't think of any other way to get null as shapebuffer.

dmmiller612 commented 7 years ago

Not sure what you mean about kryo required in the environment. I tried adding sparkConf.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" ); with no luck. My sparkConf and context is bare bones at the moment (along with arguments from spark submit).

        SparkConf sparkConf = new SparkConf().setAppName("qm-predict");
        sparkConf.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" );
        sparkContext = new JavaSparkContext(sparkConf);

Do you have to create an environment variable for kryo? I have never used explicitly it before.

raver119 commented 7 years ago

Please check this out: https://deeplearning4j.org/spark#kryo Let me know your results after you try that module

On Mon, Nov 21, 2016 at 9:42 PM Derek Miller notifications@github.com wrote:

Not sure what you mean about kryo required in the environment. I tried adding sparkConf.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" ); with no luck. My sparkConf and context is bare bones at the moment (along with arguments from spark submit).
    SparkConf sparkConf = new SparkConf().setAppName("qm-predict");
    sparkConf.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" );
    sparkContext = new JavaSparkContext(sparkConf);
Do you have to create an environment variable for kryo? I have never used explicitly it before.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/deeplearning4j/deeplearning4j/issues/2348#issuecomment-262028223, or mute the thread https://github.com/notifications/unsubscribe-auth/ALru_xvMPsAsDmfgFM4KZ5kN3cLIoxnfks5rAeX2gaJpZM4K4dkW .

dmmiller612 commented 7 years ago

@raver119 Ugh, rtfm. Totally my fault. Thanks for your patience. I'll close the ticket, but first, would you guys maybe want a custom exception for that scenario? I'd be willing to add it, if so.

raver119 commented 7 years ago

Problem is that there's 100500 possible serializer mechanisms out there. But if you have any ideas there - we'd like to get PR :)

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

deeplearning4j / deeplearning4j

Null Pointer Exception Spark Word2Vec #2348