achorg / DH-Answers-Archive

Archive version of the DH Q&A website acquired via Wayback Machine in early 2020
https://dhanswers.ach.org/
2 stars 1 forks source link

Syntax issues with URLS #13

Closed ZoeLeBlanc closed 4 years ago

ZoeLeBlanc commented 4 years ago

Currently there's a few issues with syntax in topic_url that is breaking our matching logic in the scraping script. If there's either a ? or ?replies=1 at the end of a url, it doesn't match the div link. We could either reformat these posts to not contain this syntax or add the following code to the scraper.

  1. if '?replies=1' is in topic_url.

    if 'replies' in topic_url:
        topic_url = topic_url.split('?replies')[0]
  2. if '?' is at the end of topic_url. topic_url = topic_url.replace('?', '')

    if '?' in topic_url:
        topic_url = topic_url.replace('?', '')

There's also one url with an apostrophe but as long as we use python 3, we avoid any unicode errors.

rlskoeser commented 4 years ago

I compared all the topic files with question marks with in the path with the non-question mark version, and they were either the same content or older versions missing content, so I have removed them from the site archive.

The url with the apostrophe seems to be the actual canonical url for that page (it appears that way in RSS feeds), so I'm going to leave that one, since we have a way to deal with it.