achorg / DH-Answers-Archive

Archive version of the DH Q&A website acquired via Wayback Machine in early 2020
https://dhanswers.ach.org/
2 stars 1 forks source link

Edits to DHQA Scraper and Unicode Errors #12

Closed ZoeLeBlanc closed 4 years ago

ZoeLeBlanc commented 4 years ago

This PR was primarily to add code for extracting the text from posts so that we could do some analysis on them.

I updated the content field to be html_content, and then added code that striped the paragraph tags from that content and stored in the content field.

However, I was never ever able to successfully run the script because I kept getting encoding errors. It seems like this lesson tei-in-oxygen-author-problem-with-tables’-labels-in-xsl-fo-1 has an apostrophe in it that prevents it from being handled by string formatting in python3. I tried removing it with an if statement (it's still in the script commented out) but even then I couldn't write to file and instead got this error message:

'ascii' codec can't encode character u'\xe9' in position 27

@rlskoeser would definitely appreciate your help fixing this. When I rewrote the script to work with pandas I used this encoding='utf-8-sig' to get rid of the error, but now I'm wondering why the script isn't working on my computer. Thoughts?

rlskoeser commented 4 years ago

@ZoeLeBlanc Are you using python 2 or 3 ? Asking because I'm not seeing the same error you are reporting, although I did get an error for hitting the wayback machine api too many times in a row. 😆 (I'm using python3.6)

I think we should clean up the archive and rename the files with those weird characters. I've corrected a couple of variants with question marks where there was duplicate content but I think there are still a few more.

ZoeLeBlanc commented 4 years ago

You were totally right @rlskoeser about my python version (Shakes fist at pipenv) and now everything is working. The only unicode error I got before was that topic with the apostrophe, so I can change that one in this PR (or make a new one). Thanks for helping me 👏

rlskoeser commented 4 years ago

@ZoeLeBlanc would you mind making an issue on this repo with a list of the topics with punctuation in the url? I hope to clean them up at some point. I know we have the one table and I think a couple with question marks, have you run into any others?

ZoeLeBlanc commented 4 years ago

nope! merging in now 👍