PyAr / CDPedia

CDPedia is a project to make the Wikipedia accesable offline
34 stars 15 forks source link

Add CSS scraper #317

Closed fzuccolo closed 3 years ago

fzuccolo commented 3 years ago

Integrate a CSS scraper into the general scraping procedure. As CSS data may be different for each language, scraped content must be saved into each language's dump.

Mechanism:

  1. While scraping articles, extract raw CSS links from HTML head and save them to file (as soon as found to avoid data loss).
  2. After article scraping is done, parse all collected raw links and extract unique CSS module names (wikipedia uses a modular system for requesting stylesheets)
  3. Download the CSS of each module into its own file.
  4. Parse all CSS modules for finding links to external resources (icons, backgrounds, etc.) and download them.
  5. Combine all available CSS modules into a single stylesheet, retargeting all external URLs to existing local files.
  6. On generation step, copy single stylesheet and its resources to the assets/static directory of cdpedia image.

Lot of ideas taken from @spiccinini's preprocess_stylesheets proof of concept.

Visually the results are OK, some minor styling issues were fixed manually. These are some screenshots of the main page in all available languages: es fr pt ay

fzuccolo commented 3 years ago

LIsto para review!

fzuccolo commented 3 years ago

Cierro, se eligió implementar por partes: #321 #324 #334

fzuccolo commented 3 years ago

Cierro: implementado por partes (#321, #324 y #334).