Question on distributing the read to multiple workers

allisonwang-db / pyspark-data-sources

Custom PySpark Data Sources

https://allisonwang-db.github.io/pyspark-data-sources/

Apache License 2.0

27 stars 4 forks source link

Question on distributing the read to multiple workers #2

Open LauJohansson opened 8 months ago

LauJohansson commented 8 months ago

In your Github source example, the reader read from the Github api. This read, does it utilize spark (distributes) the read across multiple worker nodes?

datanikkthegreek commented 2 months ago

@LauJohansson I thought the same when testing it the first time. Have a look in def partition() in the DataSourceReader class. This will answer it.

https://github.com/apache/spark/blob/master/python/pyspark/sql/datasource.py

I also did a sample implementation here: https://github.com/allisonwang-db/pyspark-data-sources/pull/4