Adeeshaj / Carvestor-Scraper

MIT License
0 stars 0 forks source link

Scraper - research technologies #1

Closed Adeeshaj closed 8 months ago

Adeeshaj commented 8 months ago

Choose the programming language and libraries or frameworks for web scraping. Common choices include Python, BeautifulSoup, Scrapy, or Selenium. Set up your development environment, including installing necessary libraries.

Adeeshaj commented 8 months ago

programming language - Python

Rich Ecosystem: Python has a rich ecosystem of libraries and frameworks that make web scraping easier and more efficient. Two widely used libraries are BeautifulSoup and Scrapy, which provide tools for parsing HTML and handling web scraping tasks.

Ease of Use: Python's clean and readable syntax makes it easy to write and understand code, which is especially helpful when working with web scraping, where you need to manipulate and parse HTML or other structured data formats.

Large Community: Python has a large and active community, which means you can find plenty of resources, tutorials, and support when you run into problems or have questions related to web scraping.

Third-Party Packages: Python offers a wide range of third-party packages for tasks like making HTTP requests (e.g., requests), handling data (e.g., pandas), and even for solving more complex problems using machine learning (e.g., scikit-learn).

Cross-Platform Compatibility: Python is available on various platforms (Windows, macOS, and Linux), making it a versatile choice for web scraping on different operating systems.

Web Framework Integration: Python web frameworks like Django and Flask can be used to build web applications that incorporate web scraping functionality, making it easier to interact with and display the scraped data.

Data Processing and Analysis: If your scraping project is part of a broader data analysis pipeline, Python is an excellent choice since it offers strong data processing and analysis libraries such as NumPy, pandas, and Jupyter for interactive data exploration.

Adeeshaj commented 8 months ago

Library to Scrape - BeautifulSoup

BS - BeautifulSoup SP - Scrapy NR - Neither

Aspect BeautifulSoup Scrapy Preferred
Ease of Use Simple and easy for beginners Requires a steeper learning curve BS
HTML Parsing Excels at parsing HTML documents Supports parsing structured data BS
Flexibility Highly flexible, ideal for custom solutions Offers a structured framework BS
Simplicity Suitable for single-page or simple scraping tasks Built for large-scale projects BS
Custom Parsing Logic Allows custom parsing logic using Python Provides various built-in features BS
Scalability Limited scalability for large-scale projects Designed for large-scale scraping SP
Efficiency Single-threaded, suitable for smaller tasks Offers performance optimizations SP
Crawling Not designed for web crawling Ideal for crawling multiple pages and following links SP
Middleware Limited or no built-in middleware Offers a middleware system for customization NR
Item Pipelines No built-in pipelines for processing and storing data Provides item pipelines for data processing SP
Built-In Features Few built-in features Provides solutions for common scraping challenges SP

Both have same number of Pros. But since for BeautifulSoup is Simple and easy for beginners for the initial project I propose BeautifulSoup

Adeeshaj commented 8 months ago

DataStorage - PostgreSQL

  1. Local Files - good
  2. Databases - good
  3. Cloud storage services like AWS S3, Google Cloud Storage, or Azure Blob Storage - good
  4. Data warehousing solutions like Amazon Redshift or Google BigQuery - good. but for large scale. may consider later
  5. APIs or Web Services - not related, this is for real time data extraction
  6. Data Processing Pipelines - not related, this for real time data extraction
  7. Web Scraping Frameworks' Storage - this for Scrapy

cloud storage is also saving files. so either files or database selected for comparison

F - Files DB - database

Aspect Storing Data in Local Files Storing Data in Databases Preferred
Ease of Setup Simple setup, minimal configuration required More complex setup, database configuration F
Querying Capabilities Limited querying capabilities Powerful querying with SQL DB
Data Integrity Prone to data integrity and consistency issues Enforces data integrity and consistency DB
Scalability Not suitable for large-scale projects Scalable for large datasets DB
Resource Usage Low resource usage More resource-intensive, higher usage F
Cost Cost-effective (no additional hosting costs) Associated hosting and maintenance costs DB
Concurrency Control Concurrency issues with multiple processes Concurrency control for multiple users DB

there are more pros in DB. So choosed DB.

MySQL:

MySQL is a popular open-source relational database management system. It is known for its performance, stability, and ease of use. Suitable for projects with structured data that require SQL querying and ACID compliance. PostgreSQL:

PostgreSQL is another open-source relational database system. It's highly extensible and offers advanced features like JSON support, full-text search, and spatial data types. Ideal for projects that need advanced data modeling and querying capabilities. SQLite:

SQLite is a self-contained, serverless, and zero-configuration database engine. It's lightweight and easy to use, making it suitable for small-scale projects and embedded applications. Useful for prototyping and simple data storage needs. MongoDB:

MongoDB is a popular NoSQL database that uses a document-oriented data model. It's ideal for unstructured or semi-structured data and allows for flexible schema design. Suitable for projects with rapidly evolving or complex data structures. Redis:

Redis is an in-memory data store that excels at caching and high-speed data retrieval. It's often used for real-time data and session management. Useful for projects that require low-latency data access and high-throughput read operations. Elasticsearch:

Elasticsearch is designed for full-text search and real-time analytics. It's often used to index and search large volumes of text-based data. Ideal for projects focused on searching and indexing textual data. Amazon DynamoDB:

DynamoDB is a managed NoSQL database provided by AWS. It is designed for high scalability and can handle large datasets with high read and write throughput. Suitable for projects hosted on AWS and needing scalability. Google Cloud Bigtable:

Bigtable is a NoSQL database service by Google Cloud. It's designed for large analytical and operational workloads. Ideal for projects that require high-performance and scalability on Google Cloud. Cassandra:

Apache Cassandra is a distributed NoSQL database that is highly scalable and fault-tolerant. It's suitable for big data and time-series data applications. Useful for projects with a large amount of data distributed across multiple nodes. HBase:

HBase is an open-source, distributed NoSQL database that is modeled after Google Bigtable. It's designed for large-scale, sparse, and structured data. Suitable for projects that require fast and random read/write access.

apart from above choices I choosed MongoDB and Postgres. Because If you have structured data that may change its structure over time but remains mostly stable, a NoSQL database like MongoDB or PostgreSQL can be a good choice. Both of these databases offer flexibility in data modeling and can adapt to changes in your data structure.

Use MongoDB if your data is predominantly unstructured or semi-structured and changes frequently. MongoDB is an excellent choice when your analysis involves handling diverse, evolving data formats.

Use PostgreSQL if your data, although sometimes changing, retains a significant degree of structure and consistency. PostgreSQL is a powerful choice for data analysis that involves complex querying and relational operations.

because data structure is not always changing I choose PostgreSQL