Open LauJohansson opened 8 months ago
@LauJohansson I thought the same when testing it the first time. Have a look in def partition() in the DataSourceReader class. This will answer it.
https://github.com/apache/spark/blob/master/python/pyspark/sql/datasource.py
I also did a sample implementation here: https://github.com/allisonwang-db/pyspark-data-sources/pull/4
In your Github source example, the reader read from the Github api. This read, does it utilize spark (distributes) the read across multiple worker nodes?