dmarx / psaw

Python Pushshift.io API Wrapper (for comment/submission search)
BSD 2-Clause "Simplified" License
361 stars 53 forks source link

Search functions replace text found in Pushshift with "[removed]" if the content is removed #103

Open dequeued0 opened 2 years ago

dequeued0 commented 2 years ago

If Pushshift has text, but the content is removed on Reddit, PSAW replaces the text with [removed] which isn't what I expected to happen. If the submission or comment is removed or deleted on Reddit, it would be a lot better to just use the text from Pushshift instead.

If that's not the desired default behavior when using PRAW to initialize PSAW, perhaps a new option could be added to allow "only use Reddit text if not removerd/deleted on Reddit" or "always use Pushshift text" behaviors in addition to the current behavior?

Here is an example using a shadow banned account:

>>> import praw
>>> import psaw
>>> import requests
>>> r = praw.Reddit(stuff goes in here)
>>> p = psaw.PushshiftAPI(r, shards_down_behavior=None)
>>> for comment in p.search_comments(author="goldbergdfc", limit=10):
...    print(comment.id, comment.body[:40])
hvhoymn [removed]
hvhog5a [removed]
htcus5a [removed]
htcs5si [removed]
htcg78z [removed]
htc77rw [removed]
ht7vkt2 [removed]
ht7tx1v [removed]
ht7cra8 [removed]
>>> search = { "author": "goldbergdfc", "limit": 10 }
>>> query = requests.get("https://api.pushshift.io/reddit/comment/search", params=search)
>>> query.raise_for_status()
>>> results = query.json()
>>> for result in results["data"]:
...    print(result.get("id"), str(result.get("body"))[:40])
hvhoymn  [https://onlyfans.com/liatorres](https:
hvhog5a [https://onlyfans.com/liatorres](https:/
htcus5a Fun time.............
htcs5si Yeah, no way the girl in front can't fee
htcg78z His playful energy and that expression a
htc77rw That smile is amazing!
ht7vkt2 That's some good mother & son bondin
ht7tx1v Practice makes man perfect.
ht7cra8 Smile and pose until they notice you. Th

Thanks!

dmarx commented 2 years ago

Not a bad idea. The reason it works the way it does right now (haven't touched the code in a minute, memory is fuzzy) is psaw is actually only requesting the IDs from pushshift and getting all the content from reddit, so it doesn't even know there's a delta between the old and new state. It'd be a nasty performance hit to do that diffing in psaw, but I can also see how the current state could be unpreferable as well. In the spirit of the principle of least surprise, we can add some kind of option for merging old data with new.

You know what? It actually looks like PMAW might even do what you're asking for out of the box, not to mention it's also much more actively maintained than PSAW and presumably faster to boot. Maybe I'll also take this opportunity to put PSAW in maintenance mode and add a warning in the deployed code directing people to PMAW. PSAW's had a good run, might be time to pass the torch.