val rdd = WarcUtil.load("s3a://commoncrawl/crawl-data/CC-MAIN-2016-36/segments/1471982290442.1/warc/CC-MAIN-20160823195810-00000-ip-10-153-172-175.ec2.internal.warc.gz")
val xx=rdd.take(100).map(x1=>SuperWarc(x1)).map{r =>(r.payload(textOnly = true).split(" ").mkString("Array(", ", ", ")").contains("Comments"))}
println(xx.count(_==true))
//Similarly one can try using Archive Spark in a different way, but ultimately implementing the same thing:
val xxt=rdd.enrich(HtmlText.ofEach(Html.all("body"))).toJsonStrings.take(5000).map{r=>r.split(" ").mkString("Array(", ", ", ")").contains("Comments")}
println(xxt.count(_==true))
//this way I can atleast take 5000 before the exception: org.apache.spark.SparkException: Kryo serialization failed: Buffer //overflow.
//Available: 0, required: 51. To avoid this, increase spark.kryoserializer.buffer.max value.
When doing a mapping, splitting strings there is a limit (of about 319 on my computer for take number):
import cc.warc. import spark.session.AppSparkSession import org.archive.archivespark. import org.archive.archivespark.functions._
val rdd = WarcUtil.load("s3a://commoncrawl/crawl-data/CC-MAIN-2016-36/segments/1471982290442.1/warc/CC-MAIN-20160823195810-00000-ip-10-153-172-175.ec2.internal.warc.gz")
val xx=rdd.take(100).map(x1=>SuperWarc(x1)).map{r =>(r.payload(textOnly = true).split(" ").mkString("Array(", ", ", ")").contains("Comments"))}
println(xx.count(_==true))
//Similarly one can try using Archive Spark in a different way, but ultimately implementing the same thing:
val xxt=rdd.enrich(HtmlText.ofEach(Html.all("body"))).toJsonStrings.take(5000).map{r=>r.split(" ").mkString("Array(", ", ", ")").contains("Comments")}
println(xxt.count(_==true))
//this way I can atleast take 5000 before the exception: org.apache.spark.SparkException: Kryo serialization failed: Buffer //overflow. //Available: 0, required: 51. To avoid this, increase spark.kryoserializer.buffer.max value.