Find a more efficient method to pull data from Minio. The current way data is pulled from Minio uses the urllib python library (https://docs.python.org/3/library/urllib.html), which essentially downloads the entire data csv file by sending a request via a URL, which isn't the most efficient way of doing it.
Alternatives to look into:
S3 protocol for the SparkContext method textFile(). Due to way S3 is handled by Spark, this may prove troublesome to set up.
Use this python library for Minio interaction: https://github.com/minio/minio-py . This library is specifically made for interacting with Minio, thus it may prove more efficient.
Find a more efficient method to pull data from Minio. The current way data is pulled from Minio uses the urllib python library (https://docs.python.org/3/library/urllib.html), which essentially downloads the entire data csv file by sending a request via a URL, which isn't the most efficient way of doing it.
Alternatives to look into:
Update: we decided to use the Minio API https://github.com/benchflow/data-transformers/issues/56