Develop - Githubissues

arelaxend commented 7 years ago

Hi! HNY!

I have finished a complete rework of the project because there were several issues related to:

Optimization: we must restart with a fresh and highly optimize solution. The current solution is fast, and linear in time, and the distribution of document has a equiprobable distribution at infinity,
bugs (free): some bugs have been solved,
simplification,
scala friendly: several years ago, I'd got 100% at the course of Prof. Martin Odersky on Scala (not the online course!!) ;)

A.

thammegowda commented 7 years ago

Hi @arelaxend HNY to you too..

this is great. I appreciate your time spent to make these changes. Thanks for starting to optimize, simplify and to fix bugs.

I will give you comments in the code review and ask you more details when required.

as of now, it would be great to have a brief description of bugs, optimizations, simplifications etc (for the sake of record keeping and informing the team)

Thanks :-) T.G.

thammegowda commented 7 years ago

@arelaxend I totally appreciate your great work.

However, with my comments I am seeking your help to understand these changes in the PR:

Need documentation/reference of your code style.
How the grouping works. By optimizing, are you trying to maximize the throughput without hitting the same host too many times? (Sparkler has to offer fair crawling, that it doesnt overload the web servers even if we have large cluster)
What was the reason behind taking out the Jbrowser/ Javascript rendering and what are you achieving with Jsoup. How does it work when we have non HTML/XML resources like PDFs/Office-Docs etc

arelaxend commented 7 years ago

Thank you for the review @thammegowda ! 👍 🥇

You noticed good points! At the beginning of the review I said that I removed some parts for the purpose of (re)starting with a fresh version since some parts were slowing the overall fast code 👍 This does not mean to remove these parts forever, of course. That's why I totally agree with you that in 3. a new module must be created separately to handle other ressources and potentially using jbbrowser and put the proposed one in a "default-to-the-defaulted one" module as it works efficiently for HTML ressources. I did not try with other type of ressources yet !

As for the code style, we should keep with what is used in the open source community. 👍

As for the fairness, I now understood why you added a delay! Improvements in term of efficiency was not related to the removal of the delay. We should use the delay again then !

So, what is the correct process of doing ? We close the pull request, I make a new commit and repull ?

Best, A.

thammegowda commented 7 years ago

@arelaxend This comment is for your optimization suggestions:

I am sorry to be very critic, but we can't accept these changes.

The way grouping is currently done solves a very crucial requirement for this crawler: fair crawling. I do not know your background in crawling, let me definite the requirement we are addressing here. Any fair crawler shouldn't make too many requests to web servers. Why? imagine you try to crawl Wikipedia and it goes down to other users because your crawler is taking its full service. It's not only morally and ethically unfair, but the web admins will ban the IPs bursting such requests.

Do's and Dont's for a fair crawler:

[X] Even if we run our crawler on the extremely fast CPUs, it shouldn't make too many requests. How are we solving it? We insert delays between the requests, the Thread.sleep() is used for this purpose.
[X] Even if we run the crawler on a large cluster of very large parallelization capacity, it shouldn't make too many requests How are we handling this? We partition URLs by its hostname and thus all requests sent to that particular host are given to a single worker node in cluster. The URLs we obtain may not be equally distributed on hosts as you pointed out. Fair crawling and maximizing the crawl throughput conflicts each other. I hope you now know the choice we have made. If you aren't clear with the choice or disagree with it, let me know! We can have a chat to discuss in detail. That's being said, the current grouping is first version, and we are hoping to improve it without crossing the fairness boundary.
[ ] Fair crawler obeys the robots.txt rules, that it doest access resources which the owners disallow. We have an issue #45 which is currently open. Why is it crucial ? Legality! If you are curious about what is robots file, check http://www.robotstxt.org/robotstxt.html

thammegowda commented 7 years ago

As for the code style, we should keep with what is used in the open source community.

Great :+1:

So, what is the correct process of doing ? We close the pull request, I make a new commit and repull ?

Close this PR

Create issues and describe your proposed changes/suggestions. Hear feedback of other members in team. No feeback == silence == all good, you may proceed

Raise new pull requests. We are glad to have many smaller PRs than the one large PR

arelaxend commented 7 years ago

Your first comment is very interesting. Actually, we could manage to improve both efficiency and to keep a high level of fairness since it does not seem to be contradictory. For example, we can increase the domain range (by selecting more domains in the queries) thus keeping a overall high throughput but reducing the throughput per domain. At the limit, if we ensure that we take only one URL per domain, we can remove the need of having a sleep(). I will create an issue on that topic.

USCDataScience / sparkler

Develop #62