dlubom / Polish-Cave-Data-Scraper

A Python scraper for the Central Geological Database of Polish Caves (CBDG), gathering detailed information on Polish caves, including geolocation, morphology, environmental data, and historical descriptions, along with graphic attachments, to support research and conservation.
0 stars 0 forks source link

Polish-Cave-Data-Scraper

Overview

Polish-Cave-Data-Scraper is a robust Python-based tool designed to scrape and collect comprehensive data on Polish caves from the Central Geological Database of Polish Caves (CBDG) managed by the Polish Society for Friends of Earth Sciences (PTPNoZ). The scraper gathers standardized information, including geolocation, morphology, environmental data, historical descriptions, and graphic attachments such as plans, sections, and photographs. This dataset serves as a valuable resource for researchers, conservationists, and speleologists interested in the geological and environmental aspects of Polish caves.

Requirements

Installation

  1. First, ensure you have Poetry installed on your system. If not, install it using:

    curl -sSL https://install.python-poetry.org | python3 -
  2. Clone the repository:

    git clone https://github.com/yourusername/polish-cave-data-scraper.git
    cd polish-cave-data-scraper
  3. Install project dependencies using Poetry:

    poetry install

Creating a Clean Environment

To ensure a clean environment for the project:

  1. Remove any existing virtual environment (if present):

    poetry env remove python
  2. Clear Poetry's cache (optional):

    poetry cache clear . --all
  3. Create a new virtual environment and install dependencies:

    poetry install

Usage

The scraper consists of two main scripts that should be run in sequence:

  1. First, run the data fetching script:

    poetry run python fetch.py

    This script collects raw data from the CBDG database.

  2. Then, run the parsing script:

    poetry run python parse.py

    This script processes the collected data into a structured format.