achorg / DH-Answers-Archive

Archive version of the DH Q&A website acquired via Wayback Machine in early 2020
https://dhanswers.ach.org/
2 stars 1 forks source link

Missing Dates in Web Scraped Dataset #14

Open ZoeLeBlanc opened 4 years ago

ZoeLeBlanc commented 4 years ago

Currently because of issues with wayback archive and how the RSS feeds were generated we're missing quite few dates from our dataset.

If we fix the issues in #13 we will still have 157 unique posts without any dates (representing 27 unique topic urls). Sometimes it's just one post that's missing, but often none of the posts will have dates.

One option is to manually enter these dates using the relative date (though it's unclear to me when it's relative to when exactly?).

The other option is we programmatically backfill dates, checking if there are posts within a topic with dates or getting the following topic and using that as an initial proxy. I wrote up some code to try doing this backfill logic in the colab notebook that you can see here.

Given that I can use the backfill code, this isn't a huge rush, but would be something that would be good to get sorted eventually.

rlskoeser commented 4 years ago

@ZoeLeBlanc I have an idea that we might be able to use the relative date and the snapshot date to get years, at least. The latest version of the CSV includes a relative date (the "2 years ago" from the html pages) and a snapshot date (wayback machine timestamp for the last capture of that URL). I included those values for all posts, thinking we could use it to calculate a year relative to snapshot and see how accurate it is against known dates. How hard would this be to do in your colab notebook?

I had another idea for getting dates for the topics where we don't have any date: now that we have access to the DH Q&A twitter account it occurred to me we should use the account export function. It was configured to tweet new questions, so if the topics with missing questions were auto-tweeted we can get dates (or near dates) for them.

rlskoeser commented 4 years ago

I created a data/ directory and added a text file with the output from the wayback cdx server API for all DH Answers topic urls. It has dates for every capture of every url — gives us another possible angle on getting approximate dates for the pages we're missing RSS feeds for.

ZoeLeBlanc commented 4 years ago

Thanks for working through this @rlskoeser! These all seem like awesome directions to try and get at the original dates. One thing I was confused about is when the relative date references exactly. My understanding from your answer is that it references the wayback snapshot. Is that right? If so then, it will be very easy to calculate the time delta.

Also really fascinated by the wayback captures... might quickly spin up a visualization to see how frequent those captures were and if it was evenly distributed across postings.

Happy to give you access to the colab notebook too if that would be helpful 😊

rlskoeser commented 4 years ago

@ZoeLeBlanc it is my guess that the relative date is relative to the wayback capture, but when I was glancing at the preliminary data it didn't seem to be totally accurate. I was hoping a systematic look at the relative dates + wayback snapshots would make it obvious if that guess is correct.

It's been fun doing some sleuthing to try to figure out how we can determine this missing dates! I agree the wayback captures are fascinating, I would be very interested to see a visualization of how evenly and frequently the site was captured. (Reminds me of research years ago visualizing how search engines indexed sites in passes, going deeper on successive indexes...)

I'm not very familiar with colab notebooks — but if you're up for it, maybe we could find a time to do some pair programming on this and you can help me get more familiar?

ZoeLeBlanc commented 4 years ago

Just sent you an invite to the colab notebook as an editor. I loaded in the wayback data but have no idea what it represents so would be great to have your help (you can see it here. Also would be happy to pair program, though full disclosure I only started using colab for this project, but it's pretty similar to jupyter notebooks so far. Happy to chat more via slack if that's easier for scheduling 😊