Closed arelaxend closed 7 years ago
Hi @arelaxend HNY to you too..
this is great. I appreciate your time spent to make these changes. Thanks for starting to optimize, simplify and to fix bugs.
I will give you comments in the code review and ask you more details when required.
as of now, it would be great to have a brief description of bugs, optimizations, simplifications etc (for the sake of record keeping and informing the team)
Thanks :-) T.G.
@arelaxend I totally appreciate your great work.
However, with my comments I am seeking your help to understand these changes in the PR:
Thank you for the review @thammegowda ! 👍 🥇
You noticed good points! At the beginning of the review I said that I removed some parts for the purpose of (re)starting with a fresh version since some parts were slowing the overall fast code 👍 This does not mean to remove these parts forever, of course. That's why I totally agree with you that in 3. a new module must be created separately to handle other ressources and potentially using jbbrowser and put the proposed one in a "default-to-the-defaulted one" module as it works efficiently for HTML ressources. I did not try with other type of ressources yet !
As for the code style, we should keep with what is used in the open source community. 👍
As for the fairness, I now understood why you added a delay! Improvements in term of efficiency was not related to the removal of the delay. We should use the delay again then !
So, what is the correct process of doing ? We close the pull request, I make a new commit and repull ?
Best, A.
@arelaxend This comment is for your optimization suggestions:
I am sorry to be very critic, but we can't accept these changes.
The way grouping is currently done solves a very crucial requirement for this crawler: fair crawling. I do not know your background in crawling, let me definite the requirement we are addressing here. Any fair crawler shouldn't make too many requests to web servers. Why? imagine you try to crawl Wikipedia and it goes down to other users because your crawler is taking its full service. It's not only morally and ethically unfair, but the web admins will ban the IPs bursting such requests.
Do's and Dont's for a fair crawler:
Thread.sleep()
is used for this purpose.hostname
and thus all requests sent to that particular host are given to a single worker node in cluster. The URLs we obtain may not be equally distributed on hosts as you pointed out. Fair crawling and maximizing the crawl throughput conflicts each other. I hope you now know the choice we have made. If you aren't clear with the choice or disagree with it, let me know! We can have a chat to discuss in detail. That's being said, the current grouping is first version, and we are hoping to improve it without crossing the fairness boundary.robots.txt
rules, that it doest access resources which the owners disallow. We have an issue #45 which is currently open. Why is it crucial ? Legality! If you are curious about what is robots file, check http://www.robotstxt.org/robotstxt.html As for the code style, we should keep with what is used in the open source community.
Great :+1:
So, what is the correct process of doing ? We close the pull request, I make a new commit and repull ?
- Close this PR
- Create issues and describe your proposed changes/suggestions. Hear feedback of other members in team. No feeback == silence == all good, you may proceed
- Raise new pull requests. We are glad to have many smaller PRs than the one large PR
Your first comment is very interesting. Actually, we could manage to improve both efficiency and to keep a high level of fairness since it does not seem to be contradictory. For example, we can increase the domain range (by selecting more domains in the queries) thus keeping a overall high throughput but reducing the throughput per domain. At the limit, if we ensure that we take only one URL per domain, we can remove the need of having a sleep(). I will create an issue on that topic.
Hi! HNY!
I have finished a complete rework of the project because there were several issues related to:
A.