apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
883 stars 260 forks source link

Migrate Storm-Crawler to Apache Flink #568

Closed IvanBiv closed 4 years ago

IvanBiv commented 6 years ago

@jnioche did you think about migrate this SDK to Apache Flink platform? I see Flink more better than Storm. @jnioche what do you think?

jnioche commented 6 years ago

I haven't, to be honest

I see that my friend @kkrugler recently gave a talk on the subject so it's definitely worth exploring

https://sf-2018.flink-forward.org/kb_sessions/building-a-scalable-focused-web-crawler-with-flink/

@IvanBiv what would the benefits be? why not go for Apache Beam which is more generic?

IvanBiv commented 6 years ago

@jnioche thanks for link.

Flink better than Storm for me: 1) easy deploy on Docker cluster (Kubernetes). I could not run Storm cluster on Docker Swarm, the searching about this for Kubernetes showed that there is no good solution here either. 2) community

Yep, Apache Beam can be better as middleware between user processing topology and processing work platform.

Julien, you have a lot of work in the form of StormCrawler, it is worth considering the prospects, I mean processing platform.

jnioche commented 6 years ago

Thanks @IvanBiv

I don't really see deployment on Docker as a reason to move away from Storm, as for the community, there's nothing wrong with Apache Storm one - certainly not the largest, that's true - but the project is alive and doing well.

Julien, you have a lot of work in the form of StormCrawler, it is worth considering the prospects, I mean processing platform

Sure, but it is also because I invested loads of time in Storm that I won't dump it without very good reasons. There are loads of competing frameworks for stream processing and new ones emerging all the time but as things stand I am happy with Storm. That does not mean that I am not open minded and will never consider anything else though, it's just that I'd need more compelling arguments.

I'd be curious to hear what @kkrugler thinks.

sebastian-nagel commented 6 years ago

I could not run Storm cluster on Docker Swarm, the searching about this for Kubernetes showed that there is no good solution here either.

Really? There are solutions maintained both by Storm and Kubernetes teams/projects. Wouldn't the time better invested in improving these than porting a crawler? Esp., given that there is already flink-crawler.

jorgelbg commented 6 years ago

I agree with @sebastian-nagel and @jnioche, at the moment there is no good (enough) reason to rewrite everything in Apache Flink. A lot of effort has been put already into creating/maintaining storm-crawler.

Even more, difficulty to deploy this project in a specific environment is not worth the investment of a full rewrite into a different streaming framework IMHO.

kkrugler commented 6 years ago

I've been following this discussion, and thought I'd chime in with a few thoughts:

  1. Most of the work in developing a continuous crawler lies outside of the actual streaming environment. Though being able to leverage bits from the crawler-commons project reduces that burden. But bottom line is that much of the heavy lifting is separate from the exact details of how various functions run in the streaming environment.
  2. I started on flink-crawler not because I thought that the storm-crawler project was unsuccessful, but because I wanted to explore using Flink as a platform for a continuous crawler (especially given its support for iterations), and as a test for whether it was possible to create a very simple (no other infrastructure) crawler that still was scalable and efficient.
  3. I haven't tried either Storm or Flink with Docker (or Kubernetes), so I can't speak about the level of effort or quality of integration.
  4. I've been dealing with the issue of "platform aging" in the bixo crawler project, as (a) it's become clear that continuous crawling has many advantages over batch, especially for focused crawls, and (b) the Cascading/Map-reduce platform is quickly becoming less interesting. So that's the other reason why doing something with flink-crawler was interesting.
  5. Net-net, I don't think there's a compelling reason to port storm-crawler to a different streaming environment at this point (but see below).

To be honest, I do worry a bit about Storm, as I've watched the community of streaming users transition to Spark, Samza, Flink, Heron, Kafka streams and other options over time. One metric I use is tracking activity on the user mailing list for a project - here's that graph for Storm, from the Apache mail archives:

screen shot 2018-05-17 at 12 56 22 pm

and the same result from Flink:

screen shot 2018-05-17 at 1 19 30 pm

and also for Spark:

screen shot 2018-05-17 at 1 17 23 pm

All projects go through a maturity phase where the level of user activity drops off, so I don't think Storm is dead, but I do think in a year it could be time to revisit this discussion.

jnioche commented 5 years ago

See Sematext trends on Flink,Storm,Samza (excluded Spark because a lot of it would not be about its streaming capabilities)

https://sematext.com/opensee/report/project/trend?q=Flink,Storm,Samza

jnioche commented 4 years ago

not actionable, closing for now. Feel free to reopen if relevant