bitcoinsearch / scraper

8 stars 12 forks source link

Update Bitcoin Stack Exchange Scraping Process #87

Open kouloumos opened 1 week ago

kouloumos commented 1 week ago

Our scraper is designed to keep track of content from multiple Bitcoin-related sources, including the Bitcoin Stack Exchange. Currently, however, we haven’t received new data from the Bitcoin Stack Exchange in over seven months. This is because the source we were using for periodic data dumps is no longer being updated.

Problem:

Proposed Solution:

Tasks:

  1. Investigate Stack Exchange API documentation and identify the endpoints required to fetch relevant data.
  2. Modify the scraping workflow to replace the data dump process with API-based data collection.
  3. Set up API rate-limiting and scheduling to align with Stack Exchange's usage policies, ensuring that our nightly cron jobs run smoothly.
kouloumos commented 1 week ago

@elraphty another thing to have in mind is that in the current scraper, the thread_url for answers as assigned here: https://github.com/bitcoinsearch/scraper/blob/ad894da9f891801ace8bfb5d1aaa5d3c30e2bb6f/bitcoin.stackexchange.com/main.py#L90-L91 is not correct. the tail "#" + post.attrib.get("Id") shouldn't be part of thread_url. The correct is

"thread_url": "https://bitcoin.stackexchange.com/questions/" + post.attrib.get("ParentId"), 

please have that in mind for the new implementation