21jake / nimble-scraper

4 stars 0 forks source link

[Question] Why not using background job for the scrapping process? #35

Open longnd opened 1 year ago

longnd commented 1 year ago

Issue

While the decision to delegate the work to scrap the search result to another service to do that asynchronously is a good technical decision and provides a good user experience https://github.com/21jake/nimble-scraper/blob/f0673eb1420deabbd48ad2491bc10633254e908a/backend/src/services/file.service.ts#L51-L56

https://github.com/21jake/nimble-scraper/blob/f0673eb1420deabbd48ad2491bc10633254e908a/backend/src/services/scraper.service.ts#L28-L29

But I would like to know why you decided to use Event to handle that (1):

Instead of using a background job (2) by:

Because in the 1st solution, it has some limitations:

21jake commented 1 year ago

The main reason I picked Event Emitter is because it's more easy to set up, with just one emitter, and one subscriber.

In comparison, a queue mechanism would require another DB-like service (such as Redis) and more configurations with queue management. IMO, the further overhead setup costs don't justify the potential benefits it provides, not to mention that it might not be the better solution.

1.

if the current process crash, all events are gone

If a problem arises and causes the process to crash and kill all the events, it's poor error handling and it needs to be fixed whether we use Event Emitter or not.

  1. The Cron service shouldn't be considered a work around but as a main feature. There's always a small chance of failure in however we build the scraping process and we can't assume that every search request would be successful. The cron is there to pick up and retry until every keyword is scraped.

  2. The sleep() function is bad practice, but I believe in this [situation] (https://github.com/Cuadrix/puppeteer-page-proxy/issues/76#issue-1213419824) it's acceptable. Given such issue, I have to either

    • Fork the library and fix the issue
    • Wait for the issue to be resolved
    • A hacky work around

    I believed the last one is the best one until earlier. I'm considering trying out a different library.

  3. the job execution can be delayed, thus achieving the same purpose but in a cleaner way.

    I didn't think of this while developing the scraper logic. And again, do you feel like the overhead costs are justified with a bit more syntactic sugar?

  4. You're right that we can add more workers to process multiple jobs at a time when using a background job. But the current scraping process is already doing the same job (if not better). Both the CHUNK_SIZE and WORKER_COUNT are constrained by the number of proxies. In both implementations, we need to keep the proxies healthy and not exhausted. In both implementations, the way to increase performance is to add more proxies, rather than to overuse them.

    UX is the reason I think the current implementation is better. Suppose we have two users, A and B. Both are uploading a file of 100 keywords. B sends the upload request after A, let's say 10 seconds.

    • Using a queue, the workers will try to finish the jobs in a FIFO order. Which means the user B will see their scraped data only after user A's 100 keywords got scraped.
    • With current implementation, both requests are handled concurrently, which increases overall user satisfaction.
longnd commented 1 year ago

thank you for the detailed answers, I want to add that

In comparison, a queue mechanism would require another DB-like service (such as Redis) and more configurations with queue management. MO, the further overhead setup costs don't justify the potential benefits it provides

I am not sure how much effort you anticipated to set it up? Since the application is using Docker already, adding a Redis container should be fairly easy. NestJS has built-in support for queues, such as Bull so the integration should be seamless.

1 & 2 & 4. > If a problem arises and causes the process to crash and kill all the events, it's poor error handling and it needs to be fixed whether we use Event Emitter or not.

I would not agree with this. While I agree that if is an unexpected error occurs, it needs to be fixed. But the current solution itself isn't reliable. Unlike using queue & background job - when the job data is persisted in the queue, the current solution which relies on Event Emitter to fire a "New Batch" event, an event handler will need to handle all keywords in an uploaded file and those tasks are not persisted somewhere. And if anything goes wrong when processing a chunk, the remaining chunks will not be processed. It is not the case to worry if using a queue to enqueue each keyword as a separate job, and they're independent of each other. The worker can also retry a failed job without the need of another dedicated service as a backup, like the CronService.

I believe Event are used to decouple different parts of an application, allowing them to communicate with each other without being tightly coupled. It is useful when having a system with many different components that need to interact with each other in an asynchronous way. In this case, the asynchronous work is

On the other hand, in this context, the scrapping process consists of long-running and resource-intensive tasks, it should be appropriate to use. There are other benefits that are explained clearly in NestJS queue documentation.

  1. as suggested in https://github.com/21jake/nimble-scraper/issues/37, using another approach is more efficient and can avoid the need to sleep

  2. With current implementation, both requests are handled concurrently, which increases overall user satisfaction.

I don't think so. IMO, the current solution is also also a weakness in the application architecture and does not provide good UX - by using EventEmitter and EventHandler and execute multiple Promises at the same time, the application can handle a certain number of concurrent requests at the same time only. It is also pointed out in the README

Keywords are divided into chunks of three, which means 3 keywords are scraped at a time. Increasing this chunk size leads to more frequent errors

The application also rejects when there are too many uploads

https://github.com/21jake/nimble-scraper/blob/d6415a4a5e773084269a52c32a58a4a067a7bf46/backend/src/services/file.service.ts#L38-L41

that means the application cannot sever severeal uploads at the same time. New uploads will be rejected (how can the user feel good about that)?

If using background job, it is not a problem, the keywords will be enqued and waits to be processed. Even when the number of upload increases, the application can still serve them all. Scrapping is a long running process, thus the status of each keword ("Processing", "Completed") give enough information to the users.

Note that, by separating the part of handling that long running process in the background job, the application can have a dedicated service instance for the worker (with more hardware resources) to handle it, isolating it from handling the normal web requests so the architecture is more sustainable.

21jake commented 1 year ago

Thank you for the detailed response.

I am not sure how much effort you anticipated to set it up?

Extra components

The best code is no code. Extra components means extra maintenance efforts (concurrency, disk management, extra attack surface, larger troubleshooting zone). In all fairness, setting up Queues won't be taking that much labour. Every tool has its own pros and cons, and the Queues pattern is not a Swiss army knife. The question is, is it the solution we're looking for?

You can read all about the benefits of Queues whether in NestJS docs or over the Internet. The point is, it's the inferior solution for our problem. And I'll explain that again in the last section.

And if anything goes wrong when processing a chunk, the remaining chunks will not be processed.

This is plainly wrong. Every piece of file processing is handled with proper error handling. If anything goes wrong during the scraping of a keyword, the error is guaranteeed to be catched and DB stored. I understand that a background job somewhat feels "safer" because it's more decoupled/independent and won't interfere with the main thread. However, it all comes down to how we write the code and the actual implementation.

The application can handle a certain number of concurrent requests at the same time only ... The application also rejects when there are too many uploads

It's just the trade off we have to make. You either have

  1. Limited concurrent upload, but real-time data feedback. Or
  2. Unlimited concurrent file upload, but you left your users hanging.

IMO, the current solution is also also a weakness in the application architecture and does not provide good UX - by using EventEmitter and EventHandler and execute multiple Promises at the same time New uploads will be rejected (how can the user feel good about that)

I mean, how's that a good UX if your user makes an upload and wait a long while before seeing the feedback? Should you give them a warning first and tell them to come back later, or should you let them upload a file - almost like taking a shot in the dark - and stare at a blank screen? And if you were to say that, "Well they can make an upload and comeback later to see the feedback", let's get back to the application requirements:

Each of these keywords will be used to search on http://www.google.com and will start running as soon as they are uploaded

I really hope we've received the same technical document. Are we trying to achieve the same goal? Because I feel like this requirement puts much emphasis on being responsive and and in a real-time manner. For now, we can set aside the extra works to introduce Queues into our system. One question: With just two concurrent uploads, how exactly do we feedback near-immediately scraped data using Queues? Queues do not deliver instant results. All the benefits Queues have to offer, I don't see being real time is one of them.

longnd commented 1 year ago

thank you for the follow up :)

Every piece of file processing is handled with proper error handling. If anything goes wrong during the scraping of a keyword, the error is guaranteed to be caught and DB stored

IMO, even with proper error handling, there is no guarantee that the application cannot crash; one of the common reasons is running out of memory (or memory leaks). With the updated code, I believe the part of proper error handling is this part of the code, where the scrapping process is wraped in try/catch block

https://github.com/21jake/nimble-scraper/blob/0687d4759b8b7d8440c41f4040bc9c9381102305/backend/src/services/scraper.service.ts#L39-L54

but try/catch is not the silver bullet to prevent the application from crashing, and once it occurs, the new "NEW_BATCH" events will not be handled - i.e. new uploaded CSVs will not be parsed and scrapped. Additionally, with the latest update, the current approach processes keywords in a loop, making the process brittle and inefficient. Failure related to one keyword will stop the processing of the other keywords, so separate asynchronous processes should be used for each keyword. https://github.com/21jake/nimble-scraper/blob/0687d4759b8b7d8440c41f4040bc9c9381102305/backend/src/services/scraper.service.ts#L33-L34

It's just the trade off we have to make. You either have ... Should you give them a warning first and tell them to come back later, or should you let them upload a file

I believe informing the user "The file is successfully uploaded and is being processed, please check the keyword list to get the processing status" will set the right expectation for the users. They can always check the list to see which keywords is "waiting to process", "processing," or "completed". It is more sustainable and can serve more users instead of serving only a few and rejecting the rest while processing the uploaded files. To improve the UX, the processing status and scrapping results of each keywords can always be updated to the user side in real time.

I really hope we've received the same technical document

yes, we're referring to the same technical requirement:

The uploaded file contains keywords. Each of these keywords will be used to search on http://www.google.com/ and will start running as soon as they are uploaded.

The idea is that when a user uploads a CSV file containing keywords, the application should be designed to start processing those keywords as soon as possible, in a non-blocking manner - hence the responsiveness, such that the user can continue to interact with the application while the processing of keywords processing (parsing, searching on Google, and updating the report) is in progress, and that processing should be started as soon as the file is uploaded.

-- Given limited time should be spent on the code review, we can wrap up this issue. Although we have different opinions on each approach, I respect your technical solutions and decision. Thank you for openly sharing your thoughts 🙂

21jake commented 1 year ago

Thank you for your response. I know the whole thing is wrapped up but there's something quite crucial I want to add:

1. Failure related to one keyword will not stop the processing of the other keywords.

I agree that try/catch is not a silver bullet, but good error handling is the best we can do given that a background job isn't the best option (at the moment of implementation). As you can see in this test, the batch is progressively handled and the error is stored for further inspection.

https://github.com/21jake/nimble-scraper/blob/3ff525f6bb5b4762e7b689f59d883249ee23ef85/backend/src/utils/scraper.utils.ts#L107-L112

Results:

Screen Shot 2023-03-12 at 09 48 27

current approach processes keywords in a loop, making the process brittle and inefficient

A simple for loop helps and reduce the risk of traffic spike and getting detected. We can always wrap up and trigger all search requests at the same time, but that would make the proxies more prone to detection. Please understand in a stealth job, being fast does not equal being efficient.

2. The question of "Does the app require instant responses or not?".

Again, in the requirements:

Each of these keywords [...] will start running as soon as they are uploaded.

And this is how we're trying to interpret it:

When a user uploads [...] the application should be designed to start processing those keywords as soon as possible.

I hope we both see the discrepancy I'm trying to point out, as this results in the difference in our solutions. Please understand that I'm trying to be close to the requirements as close as possible.

If it's the fact that the application doesn't require immediate responses, then Event Emitter is the inferior solution in compared to Queues.

This is the most lengthy issue so far and there are both efforts and disagreement here. I'm glad that we have mutual respect for other's approach/opinions.