PyAr / CDPedia

CDPedia is a project to make the Wikipedia accesable offline
34 stars 15 forks source link

CSS Scraper (2/3): Download CSS modules and associated resources #324

Closed fzuccolo closed 3 years ago

fzuccolo commented 3 years ago

Mechanism

  1. Load the raw CSS links that were extracted while scraping articles.
  2. Extract unique CSS module names from those raw urls (wikipedia uses a modular system for requesting stylesheets).
  3. Download the CSS of each module into its own file (this avoids having duplicated CSS rules in different files).
  4. Parse all CSS for finding links to external resources (icons, backgrounds, etc.) and download them all.

Part 2/3 for addressing issue #294.

fzuccolo commented 3 years ago

Hechas las correcciones.

Respecto a bajar/scrapear, otra opción es: descargar todo el CSS primero y buscar los enlaces a otros recursos después (simplifica un poco el código pero agrega lecturas de disco innecesarias).

fzuccolo commented 3 years ago

Mejorada la documentación de acuerdo a lo charlado en la última reunión. Listo para re-review.