Azure / azure-event-hubs-spark

Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
Apache License 2.0
235 stars 175 forks source link

Pyspark documentation error #595

Open laurencewells opened 3 years ago

laurencewells commented 3 years ago

Hi Listing the issue for anyone else having the same issue,

For the Pyspark documentation here: azure-event-hubs-spark/docs/PySpark/structured-streaming-pyspark.md

it references that the default starting position is start of stream:

image

The behaviour we were seeing is this is not the case and start of stream needs to be specifically set to stream from the start. The default behaviour, if nothing was entered was the end of stream.

nyaghma commented 3 years ago

Thanks for pointing out the error in the documentation. I'll update the pyspark doc to fix this.

stefanprisca commented 3 years ago

Hi,

We ran into the same surprise yesterday, and did not understand why the job is starting at the end of the stream all the time. This led to time wasted debugging and making workarounds. Could it be a bit prioritized? It is really confusing for pyspark users.

Thanks! Best Regards, Stefan Prisca.

laurencewells commented 3 years ago

@stefanprisca Same, we sank a good hour or two into debugging what was happening