This repository contains the DataSharingClient
class, which allows you to interact with data stored in S3 and perform queries using DuckDB. This guide will help you set up your environment, configure your credentials, and use the various functionalities provided by the DataSharingClient
.
LLMs can be a helpful partner when working with this repository. You can copy the contents of LLMPartner.txt
and add it into a chat assistant such as ChatGPT, Claude, Gemini, or any other provider you prefer. Your LLM partner can help out with syntax for SQL queries, provide guidance on using DuckDB within the DataSharingClient, and answer general questions about the code and your analysis.
venv
) moduleOpen Command Prompt and navigate to your project directory:
cd path\to\your\project
Create a virtual environment with a custom name (e.g., myenv
):
python3 -m venv newvenv
Activate the virtual environment:
source newvenv/bin/activate
Install the required dependencies:
pip install -r requirements.txt
Open Terminal and navigate to your project directory:
cd path/to/your/project
Create a virtual environment with a custom name (e.g., myenv
):
python3 -m venv myenv
Activate the virtual environment:
source myenv/bin/activate
Install the required dependencies:
pip install -r requirements.txt
Copy the example environment file and create a new .env
file:
cp .env.example .env
Open the .env
file and input your credentials:
OCEAN_USERNAME=your_username
OCEAN_PASSWORD=your_password
Run the first code block to set all imports and initialize the client:
# Initialize the client using credentials from .env file
client = DataSharingClient()
For VSCode users: You can work directly in the .ipynb
file without running the command line by selecting your virtual environment after clicking Select Kernel in the top right corner.
Default Initialization:
client = DataSharingClient()
Custom Initialization with DuckDB Parameters:
duckdb_path = "path/to/file/nameofyourduckdbfile.duckdb"
client = DataSharingClient(duckdb_region="us-east-1", duckdb_path=duckdb_path)
Creating a View from S3 URI:
# Example: Creating a view from a Parquet file in S3
s3_uri = "s3://your-bucket-name/path/to/yourfile.parquet"
view_name = "your_view_name"
client.create_view(s3_uri, view_name)
Creating a View from Local Path:
# Example: Creating a view from a Parquet file in local storage
local_path = "path/to/local/file/yourfile.parquet"
view_name = "your_view_name"
client.create_view(local_path, view_name)
Querying the View to Count the Records:
# Example: Querying the view to count the records
query = "SELECT COUNT(*) FROM your_view_name;"
result_df = client.query(query)
print(result_df)
Creating a New Table from a Query:
# Example: Creating a new table from a query
query = "SELECT * FROM your_view_name WHERE your_column > some_value;"
new_table_name = "new_table_name"
client.query(query, new_table_name)
# Example: Listing all tables and views
tables = client.list_tables()
print(tables)