dmarx / psaw

Python Pushshift.io API Wrapper (for comment/submission search)
BSD 2-Clause "Simplified" License
359 stars 52 forks source link

Add SSE stream support #19

Open dmarx opened 6 years ago

dmarx commented 6 years ago

https://www.reddit.com/r/pushshift/comments/8ks4ft/sse_streams_are_back_theres_been_enough_interest/

https://pypi.org/project/sseclient-py/

pushshift commented 6 years ago

I'm still working on fine-tuning the configuration to get it stable. There appears to be a long timeout that breaks something downstream after awhile. Currently using nginx -> gunicorn -> flask w/ gevent

Trying to isolate what's causing this at the moment. But I intend to support the stream long-term once it is resolved.

dmarx commented 6 years ago

Is there any particular reason you chose to implement the stream as a subdomain instead of an endpoint? The way you have it set up at present just seems inconsistent to me. For querying, you've got:

api.[...]/comments/search/?...

and for streaming you've got:

stream.[...]/?type=comments...

The negative consequences to this approach aren't huge, but mainly it means you can't utilize the beta subdomain to test streams. You could do something like beta.stream.pushshift, which is still inconsistent unless you also change the api's beta to beta.api.pushshift. Additionally, there's no circumstance in which the same query string could be passed to either search or stream. Insofar as a query (without an "aggs" parameter) is functionally just a filter on the data, I feel like this should be expected behavior.

I think a better approach would be:

api.[...]/comments/stream/?...

Or more even more generally:

https://<release>.pushshift.io/<dataset>/<logical partition>/<operation>/?<[item_field|filter|aggs|token|data|...]>=...

<operation> here wouldn't have to be just for consumers: you could add endpoints like/insert/, /update/ and /delete/ to generalize your internal DML operations as services. This should facilitate distributing/upscaling your data scrapers/monitors (assuming you aren't already doing something like this).

Anyway, this is all just to say:

  1. If the stream functionality is still under development, you should consider moving it to an endpoint rather than implementing it as a subdomain. If you already considered this, I'd be interested why you decided a subdomain was preferable.
  2. As you extend your service, you should try to keep the interface as general and intuitive as possible. This will help keep you from needing to reinvent the wheel every time you add a new dataset or feature. The "formula" I suggested isn't necessarily the best approach, but if you haven't already gone through an exercise like this, I'd strongly recommend trying to define criteria for how endpoints are constructed to help maintain consistency across the platform.

I hope this didn't come off as condescending, just trying to help you future-proof your project a bit.

pushshift commented 6 years ago

Not condescending at all! I agree with you. I've put up a github for the sse_ stream here: https://github.com/pushshift/reddit_sse_stream

To answer your questions:

stream.pushshift.io is for legacy purposes for when there was a stream there. I'd like it to work on that subdomain. That said, I think it makes sense to alias it to the regular API where we could also have beta streams, etc.

/reddit/comment/search -> /reddit/comment/stream (for just a comment stream) /reddit/submission/search -> /reddit/submission/stream (for just a submission stream) /reddit/stream (for a full stream)

Does that make sense to you or do you think there is a better organizational method?

dmarx commented 6 years ago

Yeah I think that makes sense