capitalone / DataProfiler

What's in your data? Extract schema, statistics and entities from datasets
https://capitalone.github.io/DataProfiler
Apache License 2.0
1.42k stars 158 forks source link

add_s3_connection_remote_loading_s3uri_feature #1054

Closed mhmotamedi closed 10 months ago

mhmotamedi commented 10 months ago

Pull Request Summary:

This PR introduces the new S3 connection feature to the DataProfiler repository. It enables DataProfiler to read data directly from remote S3 paths (s3_uri), enhancing its flexibility and data source compatibility.

Changes Made:

  1. Added S3Helper class to facilitate S3 connectivity for DataProfiler.

  2. The class accommodates various scenarios:

    • Accepting input parameters for AWS access key, secret key, session token, and region name.
    • Utilizing environment variables for AWS credentials.
  3. Added a new unit test test_s3_helper.py module to ensure the functionality of the new S3 connection feature. Also, enhanced the existing test_data.pyand test_data_utils.py unit tests.

Details:

Unit Test Added (TestS3Helper):

This PR enhances the S3 connectivity of DataProfiler, making it more versatile in handling different AWS credential scenarios. The unit test (test_s3_connection.py, test_data.py and test_data_utils.py) and executing a number of Data load operations for various data types (i.e. CSV, Parquet, TXT and JSON) by installing the library in editable mode validate the functionality.

Please review and test the changes, and let us know your feedback.

CLAassistant commented 10 months ago

CLA assistant check
All committers have signed the CLA.