ArthurHeitmann / arctic_shift

Making Reddit data accessible to researchers, moderators and everyone else. Interact with the data through large dumps, an API or web interface.
https://arctic-shift.photon-reddit.com
234 stars 15 forks source link

Clarification Regarding Data Anomalies #24

Open TehilaC opened 3 weeks ago

TehilaC commented 3 weeks ago

Hello,

I am currently working on my thesis and have been using your Reddit dataset for my research. I want to express my gratitude for the hard work that has gone into compiling and this valuable resource.

While analyzing the data, I encountered an issue that I couldn't resolve through the available documentation. Specifically, I noticed that some entries have negative "ups" values. To the best of my understanding, this field is supposed to represent upvotes, so I am unsure how it can contain negative values. Additionally, I observed that the "ups" value is always identical to the "score" value, while the "downs" field is either empty or zero.

Could you please clarify if this is expected behavior? Is it correct to assume that the "ups" and "score" values are currently identical and that separate "downs" data is not available?

I would greatly appreciate any insights you can provide.

Thanks again for your valuable work. Please note that I will, of course, properly cite your dataset in my final paper :)

ArthurHeitmann commented 3 weeks ago

In the early days of reddit, you could see exactly how many up and down votes something had. That's where there ups, downs and score (sum of both) fields come from. Then at some point reddit decided to only show the total score. But for backwards compatibility reasons, the original fields remain. downs is now always 0, and ups and score is the same. For posts you can estimate the ups and downs using the upvote_ratio.

In addition, to make vote manipulation more difficult, reddit fuzes all vote related numbers. So all votes are only accurate +/- a couple of percent.

But when looking at scores from this dataset, you have to be careful what values are useful at all. Because most content will only have been retrieved once or twice, and since then the scores have not been updated.

TehilaC commented 3 weeks ago

Thank you for your response.

I have two clarification questions:

  1. In the last paragraph, do you mean that the data is only updated up to the extraction date for each row?

  2. Is the ratio between the data still accurate after this manipulation?

ArthurHeitmann commented 3 weeks ago

Most objects have a ´retrieved_on´ and some a retrieved_2nd_on field, indicating when the data was retrieved. created_utc says when a thing was posted by the author. If something was only retrieved once and within only a few minutes of its creation, score are unreliable. If at least a couple of hours or even days have passed, the score is a lot more accurate.

Regarding upvote rations, without knowing exactly how reddit calculates the scores (their algorithms can change over time), it's hard to say for certain.

If the dataset you're looking at isn't too large, you can also collect all ids and request them again from reddit yourself. Then you'll have the most up to date values.