achorg / DH-Answers-Archive

Archive version of the DH Q&A website acquired via Wayback Machine in early 2020
https://dhanswers.ach.org/
2 stars 1 forks source link

Create Q&A dataset from static files #1

Closed amandavisconti closed 4 years ago

amandavisconti commented 4 years ago

Some researcher(s) may be interested in scraping and cleaning up the questions and answers from this site, e.g. to create a dataset illustrating some things DH community members were thinking during the site's active years.

rlskoeser commented 4 years ago

From working with the files in the archive, here are my thoughts on what a dataset might look like.

I think the main things people are like to care about are questions, answers, and (maybe) people.

The questions and answers are all in the files under topic/; profiles are under profile/, but aren't very detailed (also, there might be a distinction between members and non-members that I'm not clear on).

The RSS feeds included in the archive might be a good starting point, since it's already more structured (and includes actual dates), but I don't know how complete it is.

Here's a proposed/preliminary list of fields I think we might want for each of these.

questions

answers

people

Some member profiles are fairly detailed - e.g. see elotroalex with website, location, occupation and interests. Not sure yet how common this is.

But it also looks like people could comment without being a member - e.g. see https://dhanswers.ach.org/topic/most-effective-software-for-building-searchable-digital-bibliographic-database/#post-1858

rlskoeser commented 4 years ago

Some use cases for this dataset that I've thought of so far:

ZoeLeBlanc commented 4 years ago

Happy to offer a hand with this @rlskoeser if that would help at all! No worries though if you're already far enough along. Regardless this looks really awesome and excited to eventually check out this dataset 😊

rlskoeser commented 4 years ago

Thank you @ZoeLeBlanc ! Would love to have a collaborator — even reviewing test output would be super helpful. I have a simple start on this in the form of a python script that loads titles and tags from the topic html files. I couldn't figure out a good place to put the script, was thinking of making a gist when I get farther along - any thoughts?

ZoeLeBlanc commented 4 years ago

@rlskoeser gist sounds great 👍 ! Let me know if I can help talk through code choices at all and definitely happy to take a look at test output. Happy to put together a little jupyter notebook once you've generated the dataset to do some initial eda if that would help too.

rlskoeser commented 4 years ago

@ZoeLeBlanc : preliminary script and some data questions.csv.txt.

I'm not sure CSV is great for the multiline html content, though.

(.csv.txt because GitHub doesn't allow uploading CSV for some reason ⁉️)

ZoeLeBlanc commented 4 years ago

Just putting the link to the google colab notebook I'm working in for doing some EDA https://colab.research.google.com/drive/1CSdLUMz3fOzUWXxMWUiQaQoOmDddi5oJ for those interested in following along.

@rlskoeser or @amandavisconti let me know if you want access to the actual notebook, and happy to add you as collaborators (but no pressure either!). I'm just using pandas and altair, which will be easy to either transform into a less code heavy notebook or we can just download the visualizations as pngs or embed them in an html page. Thanks again Rebecca for making this dataset available 👏

rlskoeser commented 4 years ago

@ZoeLeBlanc wow, thank you for exploring this data.

Does CSV work ok for post content or would JSON be better?

So far I've only extracted the questions; it seems like adding the posts would tell us a lot more about activity and people. I was imagining putting questions and responses in different files because they have somewhat different fields, but I could imagine combining them into a single file (they do also have a fair bit of overlap). What do you think would be easier to work with?

Am I reading correctly that there are some records missing post content? That should be fixable. I'm curious about the missing tags — I think there were some posts without tags, but I should probably check that I'm not missing tags. I'm hopeful I'll be able to get the missing dates, too; maybe from the tag-based feeds. (If nothing else we should be able to infer year from the "X years ago" in the html).

ZoeLeBlanc commented 4 years ago

@rlskoeser I actually ended up editing your script to get post content because I was curious! You can see the changes in my gist

Feel free to disregard my edits or use them (whatever's easiest!) but I did end up joining both the author and post content into one file (though that does lead to duplicate entries). For initial posts, I just created that as a boolean field, but could see other ways of doing it too. Personally I prefer one file since it's not that much data, but could also see separating them and just joining them in the notebook.

From what I've seen there isn't that aren't that many posts without content (just two I think?), but there are a few duplicate questions because the urls ended in question marks (at least that's my hypothesis?). I like the idea of figuring out the missing dates so let me know if I can help do it (sounds like manual data entry).

ZoeLeBlanc commented 4 years ago

@amandavisconti or @rlskoeser do either of you know what the first pubdate in the RSS feed represents? Looking at this example https://dhanswers.ach.org/rss/topic/timeline-tools/index.xml I'm really confused, because the first pubdate is way later than all the actual post dates. Does this correspond to when the RSS was generated? Thoughts?

Also updates to the web scraping gist means we now have all post dates 🎉 . Gonna post my new notebook in a bit but feel free to make changes!

ZoeLeBlanc commented 4 years ago

I spoke too soon there's still bugs 😭

Take a look at this post http://digitalhumanities.org/answers/topic/tools-for-making-flow-maps and then look at the RSS feed. If you scroll to the bottom of the RSS, you'll notice it's missing the first post by Miriam. Any ideas as to why? Previous examples I've seen where a post isn't in the RSS is because it's a duplicate but this seems to be a completely unique post 🤔

My latest calculations is that we only have 11 forum questions missing RSS feeds. But it seems like for forum questions with RSS feeds, we still have 74 reply posts missing dates... a mystery 👀 (or maybe I just don't know enough about how RSS works 😂)

rlskoeser commented 4 years ago

@ZoeLeBlanc I started looking at the script based on your revisions and working on some revisions of my own - we need to revise the RSS handling to find items by permalink so we can get the correct dates.

I bet the post you mentioned doesn't include the original because there were so many responses. RSS typically only includes the most recent posts, and that one must have gotten enough entries the original is no longer excluded. It's possible one of the tag-based feeds will have some of the pubdates we're missing, but I don't know.

And yes, I think the first pub date in the RSS feed is the date the feed was published.

🤔 I wonder if older wayback machine archives would have older RSS feeds ...

rlskoeser commented 4 years ago

@ZoeLeBlanc I got inspired by your work on this and adapted some of your script into mine. Updated my gist and generated a new version with these changes:

I'm including relative date and wayback machine timestamp to see if we can calculate at least the year of the post when we don't have an RSS feed entry; I thought we could check the logic against the dates we do have. From glancing at a few, it doesn't seem to be accurate in all cases, unfortunately.

Thanks for pointing out the content with question marks. I've cleaned up at least one of those, but it looks like there are a few more.

dhqa_data.csv.txt

ZoeLeBlanc commented 4 years ago

@rlskoeser thanks for making these edits, they look awesome! Just going through your script now and gonna test out the new dataset in the colab notebook. I hadn't even thought of most these issues, so really appreciate this 👍

rlskoeser commented 4 years ago

@ZoeLeBlanc added the script to this repo, as we discussed. Also revised field names to use underscores instead of spaces.

ZoeLeBlanc commented 4 years ago

Wondering if it would be helpful for me to start writing up a dataset bio for the web scraped data? Figure we could add it to the README, and it would just outline what the data is in each field of the csv, and maybe a bit about the rationale of how everything is organized. I guess I'm worried in two weeks I'm going to forget what is what exactly 😅 🙈

Also this would help with the EDA too so that I don't assume fields represent something they don't. Could make a separate issue for discussing this too if you both think it's a good idea

rlskoeser commented 4 years ago

@ZoeLeBlanc Starting a list of the fields before we forget would be smart!

Would it make sense for the dataset to belong in a separate repo, so we can deposit it in Zenodo separately? What do you think about starting that repo now and we can put preliminary data and readme there.

ZoeLeBlanc commented 4 years ago

I love that idea @rlskoeser! Think that would make a lot of sense for long term storage and help us keep versions of the dataset as well. I've never worked with Zenodo, but happy to help set up the repo and README.

rlskoeser commented 4 years ago

Closing since @ZoeLeBlanc and I now have a preliminary dataset with all of the post fields we care about, and have decided to keep the data with this archive.