1086-Maria-Big-Data / JobAdAnalytics

3 stars 2 forks source link

Kryo Serializer Buffer Overflow #51

Closed willtsoft closed 3 years ago

willtsoft commented 3 years ago

When doing a mapping, splitting strings there is a limit (of about 319 on my computer for take number):

import cc.warc. import spark.session.AppSparkSession import org.archive.archivespark. import org.archive.archivespark.functions._

val rdd = WarcUtil.load("s3a://commoncrawl/crawl-data/CC-MAIN-2016-36/segments/1471982290442.1/warc/CC-MAIN-20160823195810-00000-ip-10-153-172-175.ec2.internal.warc.gz")

val xx=rdd.take(100).map(x1=>SuperWarc(x1)).map{r =>(r.payload(textOnly = true).split(" ").mkString("Array(", ", ", ")").contains("Comments"))}

println(xx.count(_==true))

//Similarly one can try using Archive Spark in a different way, but ultimately implementing the same thing:

val xxt=rdd.enrich(HtmlText.ofEach(Html.all("body"))).toJsonStrings.take(5000).map{r=>r.split(" ").mkString("Array(", ", ", ")").contains("Comments")}

println(xxt.count(_==true))

//this way I can atleast take 5000 before the exception: org.apache.spark.SparkException: Kryo serialization failed: Buffer //overflow. //Available: 0, required: 51. To avoid this, increase spark.kryoserializer.buffer.max value.

willtsoft commented 3 years ago

seem to have fixed it with .set("spark.kryoserializer.buffer.max.mb", "512") but will continue testing