Open CharlesNaylor opened 4 years ago
What error is raised when you use SmvCSVInputFile
?
It fails to find the connection I've defined. Not sure what the underlying problem is. The connection as defined works ok with Multi.
File "/opt/conda/default/lib/python3.6/site-packages/smv/iomod/base.py", line 57, in get_connection
conn = self.smvApp.get_connection_by_name(name)
File "/opt/conda/default/lib/python3.6/site-packages/smv/smvapp.py", line 303, in get_connection_by_name
return ConnClass(name, props)
TypeError: 'NoneType' object is not callable
Full log below: stdout.txt
Hi,
I have a large CSV file exported from Google BigQuery onto Google Cloud Storage. It gets broken into thousands of subfiles within the bucket.
If I attempt to read the file in pyspark with SmvCSVInputFile, I get an FQN error, presumably related to being unable to parse the CSV path ('gs://...My_CSV.csv', which is a bucket containing some 8000 individual files).
If I read it using SmvMultiCSVInputFiles, it discovers underlying files, then reads them in a loop:
This uses < 1% of my cluster for 11 hours as it is reading each file sequentially. If I read using SparkSession it takes minutes. I don't seem to be able to control BigQuery's partitioning strategy to reduce the number of files.
Is there any way to parallelize this operation?