CODAIT / stocator

Stocator is high performing connector to object storage for Apache Spark, achieving performance by leveraging object storage semantics.
Apache License 2.0
112 stars 72 forks source link

Stocator should create folder names with a trailing '/' in IBM COS #210

Open mariobriggs opened 5 years ago

mariobriggs commented 5 years ago

I am using Stocator via Spark to write a dataframe to IBM COS

df.write.parquet("cos://mybucket.service/tpcds/call_center")

in the above call, stocator creates the folder 'call_center' in IBM COS. However stocator does not create the folder name with a trailing '/' and as a result this messes up reading of these IBM COS folders when using other tools like Alluxio, CyberDuck etc.

Below is an example of the CyberDuck UI. Notice the folder 'call_center' is listed as a 0 byte sized file as well.

image

Browsing through the stocator code, i see the code commented out to create the foldername with a trailing '/' and using a build where it is uncommented solved the issue.

Look forward to a fix

gilv commented 5 years ago

@mariobriggs I will handle this. Thanks

kozchris commented 5 years ago

This issue is also breaking our Apache Spark reads of part files. The Apache Spark writes of the part files are creating a 0 byte directory file with no trailing slash. When we add the ending slash to the directory file that gets created the reads work again.

kozchris commented 5 years ago

@gilv how is the progress coming on a fix?

rpatel17 commented 5 years ago

I am also seeing this as a problem in our project. Thanks @gilv for looking into it.

robin-sun commented 3 years ago

Is this issue fixed now after 16 months? I am still seeing an empty file being created.

gilv commented 3 years ago

@robin-sun why there is a problem with an empty file? if you write "foo" file with Stocator via Spark it will be

foo
foo/_SUCCESS
foo/part-1-xx
foo/part-2-xx
etc.

You can now use Spark to read "foo" again and all works. If you list object storage via CLI you will see empty file "foo" and "foo/_SUCCESS". Why this is a a problem?

robin-sun commented 3 years ago

Hi Gil, This is causing errors when downloading the whole parent folder to a Windows OS as Windows doesn't support file/folder with the same name. I will have to download the output folder 1 by 1.

But I guess the question is really, why do we need an empty file if it is not used/useful at all.

mariobriggs commented 3 years ago

I think the real problem is this... if u wrote to COS using stocator, then u are forced that all your reader clients need to be using stocator as well. The latter is not under your control and therefore problematic. thanksMario     ----- Original message -----From: Robin Sun notifications@github.comTo: CODAIT/stocator stocator@noreply.github.comCc: Mario Briggs mario.briggs@in.ibm.com, Mention mention@noreply.github.comSubject: [EXTERNAL] Re: [CODAIT/stocator] Stocator should create folder names with a trailing '/' in IBM COS (#210)Date: Tue, Dec 15, 2020 4:00 PM    Hi Gil,This is causing errors when downloading the whole parent folder to a Windows OS as Windows doesn't support file/folder with the same name. I will have to download the output folder 1 by 1. But I guess the question is really, why do we need an empty file if it is not used/useful at all. —You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.  

robin-sun commented 3 years ago

Hi Mario/Gil

Could you help me understand, why do we need an empty file there?

gilv commented 3 years ago

@mariobriggs @robin-sun empty file name to simulate a folder in object storage is not invented by Stocator, but used in other Big Data systems. This is easiest way for Hadoop eco-system to mark a "folder".. So the compatibility with Windows indeed has issues with such approach. We need empty object since it has Stocator specific metadata. If you just need to download all data created by Stocator to Windows, then just write some script that will ignore empty objects.