KadekM / scrawler

Scala web crawling and scraping using fs2 streams
MIT License
15 stars 3 forks source link

parallelCrawl emits visits/stream-results in one chunk #34

Open visox opened 7 years ago

visox commented 7 years ago

code to reproduce

import com.marekkadek.scraper.Document
import com.marekkadek.scraper.jsoup.JsoupBrowser
import com.marekkadek.scrawler.crawlers.{Visit, YieldData, Yield, Crawler}
import fs2.{Strategy, Stream, Task}
import scala.concurrent.duration._

class BadCrawler extends Crawler[Task, Int](Seq(JsoupBrowser[Task](
  connectionTimeout = 20 seconds
))) {

  var visited = 0

  override protected def onDocument(document: Document): Stream[Task, Yield[Int]] = {
    val visit = (1 to 10).map{_ =>
      Visit("http://example.com/")
    }

    visited = visited + 1

    println(s"visited: $visited")

    Stream.emit(YieldData(visited)) ++ Stream.emits(visit)
  }
}

object BadCrawler extends App {
  implicit val strategy: Strategy = Strategy.fromFixedDaemonPool(100)

  val crawler = new BadCrawler()

  val stream: Stream[Task, Int] = crawler.parallelCrawl("http://example.com/", maxConnections = 10)

  stream
    .map{result =>
      println(s"result: $result")
      result
    }
    .runLog
    .unsafeRun()

}

Once run, the output is like this

visited: 1
result: 1
visited: 2
visited: 3
visited: 4
visited: 5
visited: 6
visited: 7
visited: 8
visited: 9
visited: 10
visited: 11
result: 2
result: 3
result: 4
result: 5
result: 6
result: 7
result: 8
result: 9
result: 10
result: 11
visited: 12
...
visited: 111
result: 12
...
result: 111
visited: 112
// FOR SOME TIME NOTHING 
...
visited: 1111
result: 112
...
result: 1111
// NOTHING HAPPENS (only after quite some time)

I dont mind that 10 visits need to happen before i get 10 results but if there are more pages to be visited (then maxConnection) the behavior is lagging. Both visited/result output appear suddenly after some long evaluation.

It would be desired to emit results as soon as they are available.

Right now, privately to overcome this problem, i store the toVisit urls collection and provide the urls in a managed size in onDocument that way i have to wait only for the next 10 results