CSIRT-MU / Stream4Flow

A framework for the real-time network traffic analysis based on world-leading technologies for distributed stream processing, network traffic monitoring, and visualization.
https://csirt.muni.cz/?lang=en
MIT License
101 stars 36 forks source link

Update to Latest Spark #3

Closed tomjirsa closed 7 years ago

tomjirsa commented 7 years ago

Update Stream4Flow to Spark Streaming latest version 2.1.0

tomjirsa commented 7 years ago

First prototype is solved in diploma thesis supervised by @xdanos

@xdanos - We need to start a transition to Latest stable Spark.

xdanos commented 7 years ago

Od Filipa moc informací nemám, resp. žádné zásadní zjištění.

Přechod na 2.1.0 by měl být bez problémů, co se týká clusteru jako takového. Aplikace asi fungovat nebudou, je potřeba něco málo přepsat, minimálně u Javy to platilo. Ovšem v Python aplikacích pro Spark se nevyznám a nedokážu to posoudit.

tomjirsa commented 7 years ago
tomjirsa commented 7 years ago

@xdanos / @cermmik:

Spark Download page

xdanos commented 7 years ago

Building Spark /w Hadoop, at Ansible runtime? Good luck :-D

There are million ways to cache the tar archive locally.

tomjirsa commented 7 years ago

When running application, following error occurs:

 File "/home/spark/applications/./protocols-statistics/protocols_statistics.py", line 165, in <module>
    input_stream = KafkaUtils.createStream(ssc, args.input_zookeeper, "spark-consumer-" + application_name, {args.input_topic: kafka_partitions})
  File "/opt/spark/spark-bin/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 70, in createStream
  File "/opt/spark/spark-bin/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/opt/spark/spark-bin/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o32.createStream.
: java.lang.NoClassDefFoundError: org/apache/spark/Logging
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91)
        at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:168)
        at org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper.createStream(KafkaUtils.scala:632)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 25 more
xdanos commented 7 years ago

org/apache/spark/Logging was removed after spark 1.5.2.

probably use another library?

tomjirsa commented 7 years ago

spark-streaming-kafka-0-8-assembly_2.11 does the trick. Maven repo

xdanos commented 7 years ago

We use Kafka 0.10.x -- spark-streaming-kafka-0-10-assembly_2.11 would be better.

tomjirsa commented 7 years ago

We prefer stable version 0.8 with python support - see kafka integration guide

xdanos commented 7 years ago

Jo.

tomjirsa commented 7 years ago

Application protocol statistics tested - no need for mondification. Merged with master #27 .

Stream4Flow is now running Spark 2.1.0.