Krystyna-Szybalska / JobSearchHelper

MIT License
0 stars 0 forks source link

POC: Scraping theprotocol.it #2

Open NowanIlfideme opened 7 months ago

NowanIlfideme commented 7 months ago

Create a proof-of-concept for scraping the website theprotocol.it from Python.

Here are some possible libraries to use:

I've worked with bs4 but scrapy seems like something to look at too.

You could put the POC code somewhere in the package or alongside it. Options:

  1. Put it in src/josh/scrapers/theprotocol.py
  2. Put it in poc/theprotocol.py and just create a first script that works.
  3. Put it in a Jupyter notebook under nb/scrape_theprotocol.ipynb so you have outputs to share.
NowanIlfideme commented 7 months ago

Later on, we can create a command line josh scrape theprotocol (maybe with params). And then expand to a similar interface for other sites (though the scrapers will be different).

NowanIlfideme commented 7 months ago

First get something in dict formats, later we can work on more structured metadata. ;)