aws-solutions / aws-data-lake-solution

A deployable reference implementation intended to address pain points around conceptualizing data lake architectures that automatically configures the core AWS services necessary to easily tag, search, share, and govern specific subsets of data across a business or with other external businesses.
https://aws.amazon.com/solutions/implementations/data-lake-solution/
Apache License 2.0
391 stars 160 forks source link

Blank Manifest files generated from Cart #42

Open skirk-mpr opened 4 years ago

skirk-mpr commented 4 years ago

We have several data packages within our Data Lake -- a mix of data packages created with manifest files that point to individual files in S3:

{
    "dataStore": [
        {
            "includePath": "s3://my-bucket/test/test.csv"
        },
        {
            "includePath": "s3://my-bucket/test/test2.csv"
        },
        {
            "includePath": "s3://my-bucket/test/test3.csv"
        }
    ]
}

as well as manifests that just have include paths to a "subfolder" which contain files.

{
    "dataStore": [
        {
            "includePath": "s3://my-bucket/test/"
        },

    ]
}

In both cases, after the Glue Crawlers successfully runs, we see the individual files listed as Tables in the 'Integrations' tab for the Data Package for packages created with manifest that list out each individual files. For data packages created with manifest files that point to just a "subfolder" within the bucket that contain multiple files - a single table appears in the Integrations tab. Exploring this table via the Glue link or the Athena query view, suggest its consolidate the records across the three files into a single table - even if some of the files share only some common fields in their schema but are not completely identical. Is this expected?

However, our real question/issue is when adding these two Data Packages to our cart and Generating S3 Signed URL manifests - what we are getting are essentially blank manifests, with only the following content:

{"entries":[]}

beomseoklee commented 4 years ago

@skirk-mpr For your first question, that is expected. Data Lake solution uses AWS Glue, and as you provided folder, it will crawl every file in the folder.

For your second question, that seems like a bug because data-lake-datasets table contains s3_key starting with /. I haven't figured out which causes that happens, but that's not normal.

I'm sorry for your inconvenience, and we would fix this issue in the next release.

skirk-mpr commented 4 years ago

Thank you for your response and for the clarification -- @beomseoklee!

Just to confirm, when providing an folder within the S3 bucket as part of the Data Package manifest, the Glue Crawlers will always result in a single table, even if the schema of those individual files are inconsistent -- e.g. they do not share any common header, etc?

Thank you for confirming that this is in fact a bug! Just out of curiosity, is there a regular release schedule planned for this product? It has been great to experiment with!

beomseoklee commented 4 years ago

@skirk-mpr You are right about AWS Glue.

The root cause of this one is due to this part. https://github.com/awslabs/aws-data-lake-solution/blob/e1adf3064644db007154fe717ea7716f8163c3a7/source/api/services/manifest/lib/manifest.js#L496-L501

URL parse would include / at the beginning of path, so data-lake-datasets DynamoDB table would contain / at s3_key.

So I think the simplest workaround would be either you can remove / from processManifestEntry function or you can change the below code https://github.com/awslabs/aws-data-lake-solution/blob/e1adf3064644db007154fe717ea7716f8163c3a7/source/api/services/manifest/lib/manifest.js#L269-L288

to something like

if (items[index].type === 'dataset' || items[index].content_type === 'include-path') {
    let s3Key = items[index].s3_key;
    if (s3Key.startsWith('/')) {
        s3Key = s3Key.slice(1);
    }
    checkObjectExists(items[index].s3_bucket, s3Key, function(err, data) {
        if (data) {
            if (format === 'signed-url') {
                let params = {
                    Bucket: items[index].s3_bucket,
                    Key: s3Key,
                    Expires: expiration
                };
                var _url = s3.getSignedUrl('getObject', params);
                _content.entries.push({
                    url: _url
                });
            } else if (format === 'bucket-key') {
                _content.entries.push({
                    bucket: items[index].s3_bucket,
                    key: items[index].s3Key
                });
            }
        }

Currently, this bug is added to our backlog, but we haven't scheduled the next release for the solution.

plorent commented 1 year ago

Actually, it looks like pointing to an existing location on S3 only works when you point to a file. When merely pointing to a folder on S3 that contains multiple files, you ultimately end up with an empty array in the generated manifest file. Update: just tested with a manifest.json that points to individual files on S3 and now I get download links for those files when I generate a manifest in my cart.