1086-Maria-Big-Data / JobAdAnalytics

3 stars 2 forks source link

Use contains to filter by #45

Closed willtsoft closed 3 years ago

willtsoft commented 3 years ago

First making array of strings from SuperWarc then using contains to filter, correcting code from last week and utilizing already written code to make code better. In addition to fixing bug of using contains without first splitting into elements.

willtsoft commented 3 years ago

Seems to fix it if I convert to a string, splitting on the spaces, then use the words, and count the number of times the word was contained in the a payload (max 1 for each record, since contains returns true or false for each record). Then this created a new problem that is a buffer overflow, trying repartitioning. We will see if that works.

val rdd=WarcUtil.load(path="s3a://commoncrawl/crawl-data/CC-MAIN-2014-23/segments/1405997885796.93/warc/CC-MAIN-20140722025805-00016-ip-10-33-131-23.ec2.internal.warc.gz")

val xx2=rdd2.take(350).map(x1=>SuperWarc(x1)).map{r =>(r.payload(textOnly = true).split(" ").mkString("Array(", ", ", ")").contains("Comments"))}

println(xx2.count(_ == true))