Vincent van Gogh Gallery Scrapper
Tp train a Machine Learning model based in the Vincent van Gogh collection data.
In here the script scrap the museum Webpage, recovers all the possible information
from Vincent, Including the paints description, its search tags, collection's data,
the image file, and related work.
This script creates a local gallery from the Web request within an specific path
and creates a CSV file compiling all the data recovered from the online gallery.
Each option in the menu complete the Gallery information scraping an specific column.
- Creates the gallery's index, recovering the ID, the title and the target URL to
scrap the rest of the information.
- Saves the gallery's information into a CSV file.
- Loads the gallery's information from the CSV file.
- Check the current gallery's dataframe description.
- Scrap the basic description data for each gallery's objects.
- Recover the download link to the image of each gallery's objects.
- Sets a boolean flag for each available image is available in the local directory.
- Scrap the search-tags related to each gallery's objects.
- Scrap the museum's Object-Data related to each gallery's objects.
- Scrap the related work of each of the gallery's objects.
- Export each available image into RGB and B&W images.
- Export all available data from the dataframe to JSON files in the local directory.
- Full automatic execution from step 5 to step 12.
Originaly developed for the final project for the tittle of Digital humanities
Msc. degree between 2020 - 2021.
The code was refactored and commented for the official and final presentation
for the 2020/2021 project of the Uniandes Digital Humanities graduate program.
Development Enviroment
TODO add IDE version, pyliter, python version, bs4, Selenium, pandas + links
Project Structure
LICENSE: MIT Project license description.
README: Project general description.
PROJECT STRUCTURE:
-
*\App is the main folder with the MVC (Model-View-Controller)
architecture of the script, to run it execute the view.py file and follow
the console instructions.
- Model.py module containing the Gallery class, in here the pandas
dataframe works with the Page implementation to format the scrapped data.
- View.py: Console interface to create, populate and save the gallery's dataframe.
- Controller.py: module connecting the Model.py and the View.py, it
controls the export process to JSON format and all the data cleaning functions.
-
*\Data is the folder containing the CSV files containing the gallery's
scraped data.
- _vanGoghGallery_large.csv_ Gallery's large file with 964 register of Vincent van
Gogh work.
- _vanGoghGallery_small.csv_ Gallery's small file with 61 register of Vincent van
Gogh work. Useful for functional tests.
-
*\Lib is the main folder containing modules and classes useful for
scrapping the gallery's online data.
- *\Recovery Containts the Content.py module with the Page class
for scrapping the VVG museum HTMLs.
- *\Utils Containts the Error.py module with the reraise method to
traceback errors in the code's execution.
-
*\Tests is the folder containing basic experiments and proofe of
concept of the code developed in *\Lib.
- _test_page.py_ basic tests for the Page class and its methods.
- _test_selenium_bs4.py_ proofe of concept to use selenium with bs4 in the
collection index.
Data Structure
The description of the CSV files inside the *\Data folder goes as follows:
- ID: element ID in the gallery and local folder name.
- TITLE: tittle of the element in the gallery.
- COLLECTION_URL: recovered element (paint) URL.
- DOWNLOAD_URL: direct image URL/link for the image in the gallery.
- HAS_PICTURE: boolean if there is a picture file in the local folder.
- DESCRIPTION: JSON with the description of the element.
- SEARCH_TAGS: JSON with the collection tags of the element.
- OBJ_DATA: JSON with the museum object data of the element.
- RELATED_WORKS: JSON with the related work text and URLs of the element.
- IMG_DATA: numpy RGB matrix created from the original image.
- IMG_SHAPE: numpy shape information from the original image.
Important Notes
- Config.py files are Python scripts to work around the relative import of the
project local dependencies. It is needed in all script folders such as lib,
and *\Recovery.
- Selenium needs a special instalation and configuration to execute in the
local repository. For more information go to the URLs: