IgorRozani / bulbapedia-scraper

Scraper to get data from Bulbapedia and convert to graph database
MIT License
10 stars 0 forks source link
cypher neo4j netcore pokemon scraper

Bulbapedia scraper PRs Welcome

Scraper to get data from Bulbapedia and convert to a graph database.

The script generated in this project is in the project pokemon-graph.

Pages utilized

This scraper get the data from the pages:

Project architecture

The scraper is a console application in C# and .Net Core.

Configuration

To run the project it's necessary two sections in the appsetting.json: bulbapediaConfiguration and fileExportConfiguration.

bulbapediaConfiguration

This configuration has the bulbapedia urls and paths, necessary to read the data. It's mapped in the class BulbapediaConfiguration, inside the configurations. The properties in the configurations are:

fileExportConfiguration

This configuration has the property fileFullPath, that is used the informe the file path and file name to the script generated. It's mapped in the class FileExportConfiguration, inside the configurations.

Example:

{
  "bulbapediaConfiguration": {
    "baseUrl": "https://bulbapedia.bulbagarden.net/w/index.php?title=",
    "baseImageUrl": "https://",
    "pokemonListPath": "List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number",
    "evolutionListPath": "List_of_Pok%C3%A9mon_by_evolution_family",
    "megaEvolutionListPath": "Mega_Evolution",
    "formsListPath": "List_of_Pok%C3%A9mon_with_form_differences"
  },
  "fileExportConfiguration": {
    "fileFullPath": "C:\\temp\\pokemon.cypher"
  }
}

Architecture

The project has three main folders, that sepate the context from the project: Configurations, Models and Services.

Configurations

This folder has the map from the configurations, utilized to read the configurations from the file exporation and bulbapedia urls, as explained in the last section.

Models

This folder contains the models from the domain, it's mapped all the data from the database. The main class is Pokemon, inside of it has all lists of evolutions, mega evolutions, types, forms. Inside this folder has a subfolder named Comparers, inside of it has the TypeEqualityComparer utilized in the project to compare the types.

Services

This folder is contains the logic from the project, separeted in three contexts: FileExport, Scrapers and ScriptGenerator.

FileExport

This service is responsible for exporting the script to a file, in the place configured.

Scrapers

This service is responsible for reading the data from the bulbapedia and convert it to the model objects. It has one scraper for each path in the configuration and each scraper has a specific logic for the page, because the lists share the same layout but has different structures. The scrapers are:

The PokemonList scraper needs to be runned first, it's the one that create the pokemon list, utilized by the others scrapers.

ScriptGenerator

This service is responsible for converting the pokémon list to a cypher script, it passes by all pokémon properties and create the nodes and relationships from the script.

Upcoming features