Open CWrecker opened 5 days ago
.take-issue
Label p2 cannot be managed because it does not exist in the repo. Please check your spelling.
Label cannot be managed because it does not exist in the repo. Please check your spelling.
Label cannot be managed because it does not exist in the repo. Please check your spelling.
.set-labels P2,python,io,'new feature'
cc: @damondouglas
What would you like to happen?
Apache Beam lacks a native Python-based IO connector that can ingest data directly from a socket. This feature would enable users to easily integrate streaming data sources, such as those emitting messages over TCP/IP sockets, into their Apache Beam pipelines.
Many real-time data sources, such as custom data generators, IoT devices, and legacy systems, often send data over sockets. Building a socket-based IO connector in Python would allow Beam pipelines to process this data seamlessly without requiring users to implement custom socket reading logic outside the Beam ecosystem.
Primary Question(?): Any advice on implementing an unbounded source would be appreciated. I have only recently begun to dig into Apache Beam.
Additional Context
Existing IO connectors in Beam are often geared towards standard services like Kafka, Pub/Sub, etc. Adding support for sockets will cater to users dealing with more specialized or ad-hoc data sources.
Current approach to read from socket
Pipeline Example
The current pipeline stalls when combined with a window and aggregation.
Issue Priority
Priority: 3 (nice-to-have improvement)
Issue Components