Scraper - research technologies

Adeeshaj commented 8 months ago

Choose the programming language and libraries or frameworks for web scraping. Common choices include Python, BeautifulSoup, Scrapy, or Selenium. Set up your development environment, including installing necessary libraries.

Adeeshaj commented 8 months ago

programming language - Python

Rich Ecosystem: Python has a rich ecosystem of libraries and frameworks that make web scraping easier and more efficient. Two widely used libraries are BeautifulSoup and Scrapy, which provide tools for parsing HTML and handling web scraping tasks.

Ease of Use: Python's clean and readable syntax makes it easy to write and understand code, which is especially helpful when working with web scraping, where you need to manipulate and parse HTML or other structured data formats.

Large Community: Python has a large and active community, which means you can find plenty of resources, tutorials, and support when you run into problems or have questions related to web scraping.

Third-Party Packages: Python offers a wide range of third-party packages for tasks like making HTTP requests (e.g., requests), handling data (e.g., pandas), and even for solving more complex problems using machine learning (e.g., scikit-learn).

Cross-Platform Compatibility: Python is available on various platforms (Windows, macOS, and Linux), making it a versatile choice for web scraping on different operating systems.

Web Framework Integration: Python web frameworks like Django and Flask can be used to build web applications that incorporate web scraping functionality, making it easier to interact with and display the scraped data.

Data Processing and Analysis: If your scraping project is part of a broader data analysis pipeline, Python is an excellent choice since it offers strong data processing and analysis libraries such as NumPy, pandas, and Jupyter for interactive data exploration.

Adeeshaj commented 8 months ago

Library to Scrape - BeautifulSoup

BS - BeautifulSoup SP - Scrapy NR - Neither

Aspect	BeautifulSoup	Scrapy	Preferred
Ease of Use	Simple and easy for beginners	Requires a steeper learning curve	BS
HTML Parsing	Excels at parsing HTML documents	Supports parsing structured data	BS
Flexibility	Highly flexible, ideal for custom solutions	Offers a structured framework	BS
Simplicity	Suitable for single-page or simple scraping tasks	Built for large-scale projects	BS
Custom Parsing Logic	Allows custom parsing logic using Python	Provides various built-in features	BS
Scalability	Limited scalability for large-scale projects	Designed for large-scale scraping	SP
Efficiency	Single-threaded, suitable for smaller tasks	Offers performance optimizations	SP
Crawling	Not designed for web crawling	Ideal for crawling multiple pages and following links	SP
Middleware	Limited or no built-in middleware	Offers a middleware system for customization	NR
Item Pipelines	No built-in pipelines for processing and storing data	Provides item pipelines for data processing	SP
Built-In Features	Few built-in features	Provides solutions for common scraping challenges	SP

Both have same number of Pros. But since for BeautifulSoup is Simple and easy for beginners for the initial project I propose BeautifulSoup

Adeeshaj commented 8 months ago

DataStorage - PostgreSQL

Local Files - good
Databases - good
Cloud storage services like AWS S3, Google Cloud Storage, or Azure Blob Storage - good
Data warehousing solutions like Amazon Redshift or Google BigQuery - good. but for large scale. may consider later
APIs or Web Services - not related, this is for real time data extraction
Data Processing Pipelines - not related, this for real time data extraction
Web Scraping Frameworks' Storage - this for Scrapy

cloud storage is also saving files. so either files or database selected for comparison

F - Files DB - database

Aspect	Storing Data in Local Files	Storing Data in Databases	Preferred
Ease of Setup	Simple setup, minimal configuration required	More complex setup, database configuration	F
Querying Capabilities	Limited querying capabilities	Powerful querying with SQL	DB
Data Integrity	Prone to data integrity and consistency issues	Enforces data integrity and consistency	DB
Scalability	Not suitable for large-scale projects	Scalable for large datasets	DB
Resource Usage	Low resource usage	More resource-intensive, higher usage	F
Cost	Cost-effective (no additional hosting costs)	Associated hosting and maintenance costs	DB
Concurrency Control	Concurrency issues with multiple processes	Concurrency control for multiple users	DB

there are more pros in DB. So choosed DB.

MySQL:

MySQL is a popular open-source relational database management system. It is known for its performance, stability, and ease of use. Suitable for projects with structured data that require SQL querying and ACID compliance. PostgreSQL:

PostgreSQL is another open-source relational database system. It's highly extensible and offers advanced features like JSON support, full-text search, and spatial data types. Ideal for projects that need advanced data modeling and querying capabilities. SQLite:

SQLite is a self-contained, serverless, and zero-configuration database engine. It's lightweight and easy to use, making it suitable for small-scale projects and embedded applications. Useful for prototyping and simple data storage needs. MongoDB:

MongoDB is a popular NoSQL database that uses a document-oriented data model. It's ideal for unstructured or semi-structured data and allows for flexible schema design. Suitable for projects with rapidly evolving or complex data structures. Redis:

Redis is an in-memory data store that excels at caching and high-speed data retrieval. It's often used for real-time data and session management. Useful for projects that require low-latency data access and high-throughput read operations. Elasticsearch:

Elasticsearch is designed for full-text search and real-time analytics. It's often used to index and search large volumes of text-based data. Ideal for projects focused on searching and indexing textual data. Amazon DynamoDB:

DynamoDB is a managed NoSQL database provided by AWS. It is designed for high scalability and can handle large datasets with high read and write throughput. Suitable for projects hosted on AWS and needing scalability. Google Cloud Bigtable:

Bigtable is a NoSQL database service by Google Cloud. It's designed for large analytical and operational workloads. Ideal for projects that require high-performance and scalability on Google Cloud. Cassandra:

Apache Cassandra is a distributed NoSQL database that is highly scalable and fault-tolerant. It's suitable for big data and time-series data applications. Useful for projects with a large amount of data distributed across multiple nodes. HBase:

HBase is an open-source, distributed NoSQL database that is modeled after Google Bigtable. It's designed for large-scale, sparse, and structured data. Suitable for projects that require fast and random read/write access.

apart from above choices I choosed MongoDB and Postgres. Because If you have structured data that may change its structure over time but remains mostly stable, a NoSQL database like MongoDB or PostgreSQL can be a good choice. Both of these databases offer flexibility in data modeling and can adapt to changes in your data structure.

Use MongoDB if your data is predominantly unstructured or semi-structured and changes frequently. MongoDB is an excellent choice when your analysis involves handling diverse, evolving data formats.

Use PostgreSQL if your data, although sometimes changing, retains a significant degree of structure and consistency. PostgreSQL is a powerful choice for data analysis that involves complex querying and relational operations.

because data structure is not always changing I choose PostgreSQL

Adeeshaj / Carvestor-Scraper

Scraper - research technologies #1