Architecture Design - Githubissues

luca-c-xcv commented 2 months ago

In my opinion, the overall architecture could be implemented using a shared message queue as a service for fetching data from other services.

Digging deeper: the crawler could be implemented as a service with a customizable number of nodes. Each node would parse a specific URL retrieved from the queue. While parsing, if the crawler identifies a URL that can be used to extract additional data, it should enqueue it for further processing. The abstract concept of the crawler is that of a consumer/producer, where a node can serve both roles.

feed3r commented 2 months ago

For Python, we should decide if use VirtualEnvs in Docker Images. Some useful thoughts:

https://potiuk.com/to-virtualenv-or-not-to-virtualenv-for-docker-this-is-the-question-6f980d753b46

feed3r commented 2 months ago

Still in Python argument, for the Crawlers we may use some structured tool, such as Scrapy or a combination of requests for the HTTP interaction and BeautifulSoup for the HTML parsing.

In my experience, I found the latter a more flexible approach, allowing to manage the few HTTP interaction and focusing on the HTML parsing, but we should try to think about what is better for this project.

TheG-r-itters / gitAssistant

Architecture Design #3