dungtruongtien / kewe-crawler

0 stars 0 forks source link

[Question] Why does the application need to be architected in a complex way? #20

Open longnd opened 7 months ago

longnd commented 7 months ago

Issue

The requirements for this code challenge do not require a sophisticated architecture, so I would like to understand your rationale for the application design, which is quite complex to me.

  1. Why did you decide to use RabbitMQ instead of a simple Task queue library like BullJS? Given both the API service & Crawler service are express applications, and you're using Redis for caching, isn't a Task queue library like https://github.com/OptimalBits/bull is enough for the need?

  2. Why is it necessary to separate the crawler to another service (standalone application)? The decision requires more effort to maintain the codebase as there are some parts that are duplicated in both the API & crawler services (e.g. the models, the connections to the queue), aren't combined in a single service (e.g. the API is enough)

Similar to the file service, it has limited benefits, isn't storing the HTML content in a column in the keyword table enough, given that the HTML of a first search result page is not much?

dungtruongtien commented 7 months ago

Similar to the file service, it has limited benefits...

Yes, if we just use only HTML for listing API and limit rows per page with a small quantity (maybe around 10 to a maximum of 20). So storing the raw HTML directly in the database will be fine.

But I think in the feature, maybe we want to report or analytics about the crawled keyword, with HTML stored directly in the database might be affected to the performance. That's why I stored the HTML content to another service and only stored the file path in database.

dungtruongtien commented 7 months ago

2. Why is it necessary to separate the crawler to another service (standalone application)?

I think Google is detecting spamming requests based on IP. So my solution, to increase the performance of the crawling process. We can scale multiple crawler instances and deploy it onto some other machines with another IP address (or use with some VPN...). Separate the crawler can help us to only focus on how to scale the crawling function.

This improvement might not be the best solution. In my opinion, I think it's still too much complicated. I think I should take a while to research How Google detects a spamming request (based on IP, request per time, or anything else) to develop another crawling algorithm for Crawler Service before thinking about scaling it up to multiple instances.

dungtruongtien commented 7 months ago

1. Why did you decide to use RabbitMQ instead of a simple Task queue...

Thanks for this question. Love to share with you why I use RabbitMQ.

Firstly, I'll share with you the problem and why I used a message queue to solve it.

Requirement: Crawling searched results from Google (without using GoogleAPI or third-party).

Solution 1: I used Axios in NodeJS server to get an HTML response from Google API, and used Cheerio to process the HTML response as a DOM to get the needed information.

Problem with Solution 1:

So that is why I choose another approach for this requirement:

Solution 2:

Pros and cons of this solution Pros

Cons

longnd commented 7 months ago

Thank for you the answers, let me address one by one

Yes, if we just use only HTML for listing API and limit rows per page with a small quantity (maybe around 10 to a maximum of 20). So storing the raw HTML directly in the database will be fine.

Paging shouldn't be the blocker for storing HTML content inside the table, as we don't need to show the entire HTML content on each row, it can be fetched when the user decide to view it, e.g. by opening it on another page, using a model with iframe to render, etc, each will request the upstream for the content.

But I think in the feature, maybe we want to report or analytics about the crawled keyword, with HTML stored directly in the database might be affected to the performance

the rationale makes sense, however, there are better approaches, e.g. storing the file on cloud storage (S3 for example). Running a file service requires operation cost (to maintain the service and paying for the server).

I think Google is detecting spamming requests based on IP. So my solution, to increase the performance of the crawling process. We can scale multiple crawler instances and deploy it onto some other machines with another IP address (or use with some VPN...). Separate the crawler can help us to only focus on how to scale the crawling function.

If the crawler is part of the API service, it can still be deployed to multiple services, hence aligned with your expectations. As mentioned above, having more services increases the maintenance cost and also require duplicating some code. And to workaround Google rate-limiting, besides rotating the IPs, there are some other solutions, e.g. rotating the user-agents, which is simpler to implement, hope you consider about that :)

After crawling about 10 times continuously without delay. Google marked my service as spam and it responded to my service with a Captcha page.

this is the core part of the requirements and requires the candidates (like you) to overcome the challenges from Google. As mentioned, rotating the User-Agent is one of the possible ways.

With Axios and Cheerio, I can only receive raw HTML so I can't get the statistical results (e.g. About 1,880,000 results (0.51 seconds)). Because I think these statistical results were appended to the DOM with javascript after the DOM loaded.

I suspect you did not setup it correct. I tried running CURL command with a proper user agent to send the request to Google and I received the expected result. You can even try with Postman to see :)

I have a Crawler Service uses Puppeteer to act like browsing to Google page, after the page loaded, I use pure-javascript to get all the needed information and store it.

Using Puppeteer isn't wrong, it isn't just performance as making pure HTTP request using a library like AXIOS and consume more resources (as it require running headless Chromium browser).

longnd commented 7 months ago

Thank you for all of the relies :)