CoIArt / VVG-Gallery-Scrapy

Vincent Van Gogh's museum gallery scraping code to create a ML model dataset
MIT License
0 stars 0 forks source link
datasets-creatiom image-processing machine-learning python3 scrapper vincent-van-gogh

Vincent van Gogh Gallery Scrapper

Tp train a Machine Learning model based in the Vincent van Gogh collection data. In here the script scrap the museum Webpage, recovers all the possible information from Vincent, Including the paints description, its search tags, collection's data, the image file, and related work.

This script creates a local gallery from the Web request within an specific path and creates a CSV file compiling all the data recovered from the online gallery.

Each option in the menu complete the Gallery information scraping an specific column.

  1. Creates the gallery's index, recovering the ID, the title and the target URL to scrap the rest of the information.
  2. Saves the gallery's information into a CSV file.
  3. Loads the gallery's information from the CSV file.
  4. Check the current gallery's dataframe description.
  5. Scrap the basic description data for each gallery's objects.
  6. Recover the download link to the image of each gallery's objects.
  7. Sets a boolean flag for each available image is available in the local directory.
  8. Scrap the search-tags related to each gallery's objects.
  9. Scrap the museum's Object-Data related to each gallery's objects.
  10. Scrap the related work of each of the gallery's objects.
  11. Export each available image into RGB and B&W images.
  12. Export all available data from the dataframe to JSON files in the local directory.
  13. Full automatic execution from step 5 to step 12.

Originaly developed for the final project for the tittle of Digital humanities Msc. degree between 2020 - 2021.

The code was refactored and commented for the official and final presentation for the 2020/2021 project of the Uniandes Digital Humanities graduate program.


Development Enviroment

TODO add IDE version, pyliter, python version, bs4, Selenium, pandas + links


Project Structure

LICENSE: MIT Project license description.

README: Project general description.

PROJECT STRUCTURE:


Data Structure

The description of the CSV files inside the *\Data folder goes as follows:


Important Notes