Closed ChuckConnell closed 2 years ago
Seems like you cannot read directly from local file path, without uploading your local data into Databricks workspace.
(I think maybe the "local file path" doesn't meant the path from your computer, it rather means the path from local file system in Databricks)
Could you refer to similar discussion at https://forums.databricks.com/questions/22008/how-to-read-local-files-into-dataframe-or-temp-tab.html ?
I'm not sure, but it's worth to checkout S3 or Blob storage if you want to automatically upload your "computer data" into Databricks workspace (also known as DBFS).
This is a big problem. Many pandas programs start with read_csv() of a local file on that computer, not in cloud storage somewhere.
The lack of this feature in Koalas is a significant barrier to its adoption by pandas users.
Yes, if there are only a few files, you can manually add them as Databricks tables using the point-and-click GUI method. And yes, you can put the files into S3 or Azure blob and programmatically access them from Databricks. But those are awkward non-intuitive solutions for programmers who are accustomed to opening local files with read_csv().
Yeah, I agree that it's pretty annoying process if you have to upload your data into Databricks file system one by one.
But I think maybe you can upload the bunch of files at once via Databricks GUI as following steps ??
Or you can drag the many files as much as you want, and just drop on the browser, it works as well.
Ah! Thank you. I will try this. So one point-and-click would be required, to get all the files into DBFS. Then you could programmatically create DataFrames from the DBFS files. That is definitely an improvement.
It does not help the case where there are new files every day (or hour). But it is a nice feature.
Will test it. Thank you.
@itholic What version of Databricks are you using on what platform? I just tested on the Community edition and do not see the DBFS option you showed.
@ChuckConnell Oh, yeah. Seems like the Community edition doesn't support the DBFS upload just same as mine.
But you can still upload the local file with one point-and-click in Community edition.
read_csv
@itholic Thank you again. I will test it.
Testing now... I noticed that you can specify a new folder within FileStore/tables/ so that your uploaded files never conflict with any existing names.
This seems to be a pretty good solution for processing static files. It requires human intervention, but is fairly easy and can handle many files at once. I presume the Python code can even loop over FileStore/tables/my_dir/ to find the file names after they are uploaded.
This approach does not help for scenarios where there are new files every day/hour/minute, which is common in data pipelines, but in this case the files are probably in a cloud drive (not local) anyway.
Thanks for the report, @ChuckConnell !
Close this for now, since it's more actually the limitation of Databricks solution itself rather than Koalas.
Discussed in https://github.com/databricks/koalas/discussions/2177