USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

Develop #62

Closed arelaxend closed 7 years ago

arelaxend commented 7 years ago

Hi! HNY!

I have finished a complete rework of the project because there were several issues related to:

A.

thammegowda commented 7 years ago

Hi @arelaxend HNY to you too..

this is great. I appreciate your time spent to make these changes. Thanks for starting to optimize, simplify and to fix bugs.

I will give you comments in the code review and ask you more details when required.

as of now, it would be great to have a brief description of bugs, optimizations, simplifications etc (for the sake of record keeping and informing the team)

Thanks :-) T.G.

thammegowda commented 7 years ago

@arelaxend I totally appreciate your great work.

However, with my comments I am seeking your help to understand these changes in the PR:

  1. Need documentation/reference of your code style.
  2. How the grouping works. By optimizing, are you trying to maximize the throughput without hitting the same host too many times? (Sparkler has to offer fair crawling, that it doesnt overload the web servers even if we have large cluster)
  3. What was the reason behind taking out the Jbrowser/ Javascript rendering and what are you achieving with Jsoup. How does it work when we have non HTML/XML resources like PDFs/Office-Docs etc
arelaxend commented 7 years ago

Thank you for the review @thammegowda ! 👍 🥇

You noticed good points! At the beginning of the review I said that I removed some parts for the purpose of (re)starting with a fresh version since some parts were slowing the overall fast code 👍 This does not mean to remove these parts forever, of course. That's why I totally agree with you that in 3. a new module must be created separately to handle other ressources and potentially using jbbrowser and put the proposed one in a "default-to-the-defaulted one" module as it works efficiently for HTML ressources. I did not try with other type of ressources yet !

As for the code style, we should keep with what is used in the open source community. 👍

As for the fairness, I now understood why you added a delay! Improvements in term of efficiency was not related to the removal of the delay. We should use the delay again then !

So, what is the correct process of doing ? We close the pull request, I make a new commit and repull ?

Best, A.

thammegowda commented 7 years ago

@arelaxend This comment is for your optimization suggestions:

I am sorry to be very critic, but we can't accept these changes.

The way grouping is currently done solves a very crucial requirement for this crawler: fair crawling. I do not know your background in crawling, let me definite the requirement we are addressing here. Any fair crawler shouldn't make too many requests to web servers. Why? imagine you try to crawl Wikipedia and it goes down to other users because your crawler is taking its full service. It's not only morally and ethically unfair, but the web admins will ban the IPs bursting such requests.

Do's and Dont's for a fair crawler:

thammegowda commented 7 years ago

As for the code style, we should keep with what is used in the open source community.

Great :+1:

So, what is the correct process of doing ? We close the pull request, I make a new commit and repull ?

  1. Close this PR
  2. Create issues and describe your proposed changes/suggestions. Hear feedback of other members in team. No feeback == silence == all good, you may proceed
  3. Raise new pull requests. We are glad to have many smaller PRs than the one large PR
arelaxend commented 7 years ago

Your first comment is very interesting. Actually, we could manage to improve both efficiency and to keep a high level of fairness since it does not seem to be contradictory. For example, we can increase the domain range (by selecting more domains in the queries) thus keeping a overall high throughput but reducing the throughput per domain. At the limit, if we ensure that we take only one URL per domain, we can remove the need of having a sleep(). I will create an issue on that topic.