Open dmarx opened 6 years ago
I'm still working on fine-tuning the configuration to get it stable. There appears to be a long timeout that breaks something downstream after awhile. Currently using nginx -> gunicorn -> flask w/ gevent
Trying to isolate what's causing this at the moment. But I intend to support the stream long-term once it is resolved.
Is there any particular reason you chose to implement the stream as a subdomain instead of an endpoint? The way you have it set up at present just seems inconsistent to me. For querying, you've got:
api.[...]/comments/search/?...
and for streaming you've got:
stream.[...]/?type=comments...
The negative consequences to this approach aren't huge, but mainly it means you can't utilize the beta
subdomain to test streams. You could do something like beta.stream.pushshift
, which is still inconsistent unless you also change the api's beta to beta.api.pushshift
. Additionally, there's no circumstance in which the same query string could be passed to either search or stream. Insofar as a query (without an "aggs" parameter) is functionally just a filter on the data, I feel like this should be expected behavior.
I think a better approach would be:
api.[...]/comments/stream/?...
Or more even more generally:
https://<release>.pushshift.io/<dataset>/<logical partition>/<operation>/?<[item_field|filter|aggs|token|data|...]>=...
<operation>
here wouldn't have to be just for consumers: you could add endpoints like/insert/
, /update/
and /delete/
to generalize your internal DML operations as services. This should facilitate distributing/upscaling your data scrapers/monitors (assuming you aren't already doing something like this).
Anyway, this is all just to say:
I hope this didn't come off as condescending, just trying to help you future-proof your project a bit.
Not condescending at all! I agree with you. I've put up a github for the sse_ stream here: https://github.com/pushshift/reddit_sse_stream
To answer your questions:
stream.pushshift.io is for legacy purposes for when there was a stream there. I'd like it to work on that subdomain. That said, I think it makes sense to alias it to the regular API where we could also have beta streams, etc.
/reddit/comment/search -> /reddit/comment/stream (for just a comment stream) /reddit/submission/search -> /reddit/submission/stream (for just a submission stream) /reddit/stream (for a full stream)
Does that make sense to you or do you think there is a better organizational method?
Yeah I think that makes sense
https://www.reddit.com/r/pushshift/comments/8ks4ft/sse_streams_are_back_theres_been_enough_interest/
https://pypi.org/project/sseclient-py/