Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.83k stars 2.93k forks source link

S3 '_$folder$' files and Presto/Hive on read from Alluxio #8602

Closed safqwf closed 5 years ago

safqwf commented 5 years ago

Is your feature request related to a problem? Please describe. Assume we have a directory with Parquet files in S3 called mydir. Alluxio setup in EMR cluster with this directory as the UFS with read-only permission. The data in this directory is generated by S3 Hadoop related components that create $folder$ files in the directory. These $folder$ files should not be deleted. Presto and Hive in the EMR cluster query a table with LOCATION 'alluxio://master_hostname:port/mydir' When trying to query the data with Presto or Hive, I'm getting this error: Query 20190321_132537_00026_4enx4 failed: Error opening Hive split alluxio://master_hostname:port/year=2019/month=01_$folder$ (offset=0, length=0): alluxio://master_hostname:port/year=2019/month=01_$folder$ is not a valid Parquet File

Hive doesn't have an option to ignore specific files based on regex, neither Alluxio. These files shouldn't be deleted.

Describe the solution you'd like Add configuration to ignore files based on regex, or Add configuration to ignore _$folder files.

Describe alternatives you've considered Can't find any solution.

aaudiber commented 5 years ago

@roman-io Alluxio has a similar concept of using empty placeholder objects to represent directories. Instead of _$folder$ we use / by default. The suffix is controlled by the alluxio.underfs.s3a.directory.suffix property. Can you try setting the property to _$folder$? Then Alluxio will understand that month=01_$folder$ is a folder, not a file.

To update the property, update alluxio-site.properties on all servers

alluxio.underfs.s3a.directory.suffix=_$folder$

then restart the cluster

apc999 commented 5 years ago

this feature request can be already achieved by existing alluxio properties. We will close this Issue in a few days if you don't have further request .

safqwf commented 5 years ago

@roman-io Alluxio has a similar concept of using empty placeholder objects to represent directories. Instead of _$folder$ we use / by default. The suffix is controlled by the alluxio.underfs.s3a.directory.suffix property. Can you try setting the property to _$folder$? Then Alluxio will understand that month=01_$folder$ is a folder, not a file.

To update the property, update alluxio-site.properties on all servers

alluxio.underfs.s3a.directory.suffix=_$folder$

then restart the cluster

It worked. Thanks!