databricks / koalas

Koalas: pandas API on Apache Spark
Apache License 2.0
3.32k stars 356 forks source link

Using read_csv within Databricks to open a local file #2178

Closed ChuckConnell closed 2 years ago

ChuckConnell commented 2 years ago

Discussed in https://github.com/databricks/koalas/discussions/2177

Originally posted by **ChuckConnell** June 29, 2021 I have imported some code from pandas to Databricks/Koalas. My read_csv statement does not work because the file is local to my computer, not within the Databricks file system (DBFS). I feel like I am missing something obvious. I want my pandas code to work on Databricks/Koalas with minor changes. I know I can use the Databricks GUI point-and-click to create a DBFS table, then make a DataFrame from the table, but that is not programmatic and is a poor solution if I have hundreds of local files. `import databricks.koalas as ks myDF = ks.read_csv("C:/Users/chuck/Desktop/County.csv") java.io.IOException: No FileSystem for scheme: C`
itholic commented 2 years ago

Seems like you cannot read directly from local file path, without uploading your local data into Databricks workspace.

(I think maybe the "local file path" doesn't meant the path from your computer, it rather means the path from local file system in Databricks)

Could you refer to similar discussion at https://forums.databricks.com/questions/22008/how-to-read-local-files-into-dataframe-or-temp-tab.html ?

I'm not sure, but it's worth to checkout S3 or Blob storage if you want to automatically upload your "computer data" into Databricks workspace (also known as DBFS).

ChuckConnell commented 2 years ago

This is a big problem. Many pandas programs start with read_csv() of a local file on that computer, not in cloud storage somewhere.

The lack of this feature in Koalas is a significant barrier to its adoption by pandas users.

Yes, if there are only a few files, you can manually add them as Databricks tables using the point-and-click GUI method. And yes, you can put the files into S3 or Azure blob and programmatically access them from Databricks. But those are awkward non-intuitive solutions for programmers who are accustomed to opening local files with read_csv().

itholic commented 2 years ago

Yeah, I agree that it's pretty annoying process if you have to upload your data into Databricks file system one by one.

But I think maybe you can upload the bunch of files at once via Databricks GUI as following steps ??

  1. Click "Data" on your Databricks GUI

Screen Shot 2021-07-30 at 11 54 38 AM

  1. Click "DBFS", not "Database Tables", and click "Upload"

Screen Shot 2021-07-30 at 11 56 27 AM

  1. Drag & drop your folder that includes bunch of files, not click to browse. You don't need to upload the every single file one by one.

Screen Shot 2021-07-30 at 11 57 55 AM

  1. Then you can see the multiple file paths for each file. You can copy this path for your code.

Screen Shot 2021-07-30 at 12 04 09 PM

Or you can drag the many files as much as you want, and just drop on the browser, it works as well.

ChuckConnell commented 2 years ago

Ah! Thank you. I will try this. So one point-and-click would be required, to get all the files into DBFS. Then you could programmatically create DataFrames from the DBFS files. That is definitely an improvement.

It does not help the case where there are new files every day (or hour). But it is a nice feature.

Will test it. Thank you.

ChuckConnell commented 2 years ago

@itholic What version of Databricks are you using on what platform? I just tested on the Community edition and do not see the DBFS option you showed.

itholic commented 2 years ago

@ChuckConnell Oh, yeah. Seems like the Community edition doesn't support the DBFS upload just same as mine.

But you can still upload the local file with one point-and-click in Community edition.

  1. Click "Data" and "Create Table"

Screen Shot 2021-08-02 at 1 23 48 PM

  1. Just drag & drop multiple files on the browser.

Screen Shot 2021-08-02 at 1 17 31 PM

  1. Check the file paths below.

Screen Shot 2021-08-02 at 1 17 38 PM

  1. Use the given paths in the notebook to load the data with read_csv

Screen Shot 2021-08-02 at 1 21 57 PM

ChuckConnell commented 2 years ago

@itholic Thank you again. I will test it.

Testing now... I noticed that you can specify a new folder within FileStore/tables/ so that your uploaded files never conflict with any existing names.

This seems to be a pretty good solution for processing static files. It requires human intervention, but is fairly easy and can handle many files at once. I presume the Python code can even loop over FileStore/tables/my_dir/ to find the file names after they are uploaded.

This approach does not help for scenarios where there are new files every day/hour/minute, which is common in data pipelines, but in this case the files are probably in a cloud drive (not local) anyway.

itholic commented 2 years ago

Thanks for the report, @ChuckConnell !

Close this for now, since it's more actually the limitation of Databricks solution itself rather than Koalas.