aws-solutions / aws-data-lake-solution

A deployable reference implementation intended to address pain points around conceptualizing data lake architectures that automatically configures the core AWS services necessary to easily tag, search, share, and govern specific subsets of data across a business or with other external businesses.
https://aws.amazon.com/solutions/implementations/data-lake-solution/
Apache License 2.0
399 stars 160 forks source link

How do I store Parquet files? #10

Closed shry15harsh closed 6 years ago

shry15harsh commented 7 years ago

There is no way to upload a folder or link the existing folder in S3 to Data Lake Package content. My data is in Parquet format. How do I go about this kind of partition formats if I want to use the Data Lake solution?

aureq commented 6 years ago

@shry15harsh You can't specify a "folder" since S3 doesn't really have that. However, if you need to reference multiple files in a given package, then you need to create your own manifest file and submit it. There's a doc about this. http://docs.awssolutionsbuilder.com/data-lake/user-guide/working-with-packages/

As for the Parquet format, the data lake solution doesn't care about the file format you use.

shry15harsh commented 6 years ago

Thanks. But Parquet file is itself a folder containing multiple files because it is a splittable file format. How do I store this type of file in Data Lake Package? I want to use this data in my Big Data Processing frameworks so I need splittable data format.

hvital commented 6 years ago

The version 2.0 was published and you can link existing folder content.

When importing data, you must specify a single include path. Ex: If you have bucket-name/folder-name/file-name.ext and want to include all objects of that bucket, specify just the bucket-name in the include path.

Does adding bucket-name/parquet_folder/ help in your case?