Open luca-c-xcv opened 2 months ago
For Python, we should decide if use VirtualEnvs in Docker Images. Some useful thoughts:
https://potiuk.com/to-virtualenv-or-not-to-virtualenv-for-docker-this-is-the-question-6f980d753b46
Still in Python argument, for the Crawlers we may use some structured tool, such as Scrapy or a combination of requests for the HTTP interaction and BeautifulSoup for the HTML parsing.
In my experience, I found the latter a more flexible approach, allowing to manage the few HTTP interaction and focusing on the HTML parsing, but we should try to think about what is better for this project.
In my opinion, the overall architecture could be implemented using a shared message queue as a service for fetching data from other services.
Digging deeper: the crawler could be implemented as a service with a customizable number of nodes. Each node would parse a specific URL retrieved from the queue. While parsing, if the crawler identifies a URL that can be used to extract additional data, it should enqueue it for further processing. The abstract concept of the crawler is that of a consumer/producer, where a node can serve both roles.