Open dequeued0 opened 2 years ago
Not a bad idea. The reason it works the way it does right now (haven't touched the code in a minute, memory is fuzzy) is psaw is actually only requesting the IDs from pushshift and getting all the content from reddit, so it doesn't even know there's a delta between the old and new state. It'd be a nasty performance hit to do that diffing in psaw, but I can also see how the current state could be unpreferable as well. In the spirit of the principle of least surprise, we can add some kind of option for merging old data with new.
You know what? It actually looks like PMAW might even do what you're asking for out of the box, not to mention it's also much more actively maintained than PSAW and presumably faster to boot. Maybe I'll also take this opportunity to put PSAW in maintenance mode and add a warning in the deployed code directing people to PMAW. PSAW's had a good run, might be time to pass the torch.
If Pushshift has text, but the content is removed on Reddit, PSAW replaces the text with
[removed]
which isn't what I expected to happen. If the submission or comment is removed or deleted on Reddit, it would be a lot better to just use the text from Pushshift instead.If that's not the desired default behavior when using PRAW to initialize PSAW, perhaps a new option could be added to allow "only use Reddit text if not removerd/deleted on Reddit" or "always use Pushshift text" behaviors in addition to the current behavior?
Here is an example using a shadow banned account:
Thanks!