DataSharing Platform

This repository contains the DataSharingClient class, which allows you to interact with data stored in S3 and perform queries using DuckDB. This guide will help you set up your environment, configure your credentials, and use the various functionalities provided by the DataSharingClient.

LLM Partner

LLMs can be a helpful partner when working with this repository. You can copy the contents of LLMPartner.txt and add it into a chat assistant such as ChatGPT, Claude, Gemini, or any other provider you prefer. Your LLM partner can help out with syntax for SQL queries, provide guidance on using DuckDB within the DataSharingClient, and answer general questions about the code and your analysis.

Setup Instructions

Prerequisites

Python 3.7 or higher
pip (Python package installer)
Virtual environment (venv) module

Setting Up the Virtual Environment

Windows

Open Command Prompt and navigate to your project directory:
```
cd path\to\your\project
```
Create a virtual environment with a custom name (e.g., myenv):
```
python -m venv myenv
```
Activate the virtual environment:
```
myenv\Scripts\activate
```
Install the required dependencies:
```
pip install -r requirements.txt
```

Linux

Open Terminal and navigate to your project directory:
```
cd path/to/your/project
```
Create a virtual environment with a custom name (e.g., myenv):
```
python3 -m venv myenv
```
Activate the virtual environment:
```
source myenv/bin/activate
```
Install the required dependencies:
```
pip install -r requirements.txt
```

Configuring Your Environment

Copy the example environment file and create a new .env file:
```
cp .env.example .env
```

Open the .env file and input your credentials:

OCEAN_USERNAME=your_username
OCEAN_PASSWORD=your_password

Example Usage

Setting Up the Environment

Run the first code block to set all imports and initialize the client:

# Initialize the client using credentials from .env file
client = DataSharingClient()

For VSCode users: You can work directly in the .ipynb file without running the command line by selecting your virtual environment after clicking Select Kernel in the top right corner.

Initialization with Different Config Options

Default Initialization:
```
client = DataSharingClient()
```

Custom Initialization with DuckDB Parameters:

duckdb_path = "path/to/file/nameofyourduckdbfile.duckdb"
client = DataSharingClient(duckdb_region="us-east-1", duckdb_path=duckdb_path)

Creating a View

Creating a View from S3 URI:

# Example: Creating a view from a Parquet file in S3
s3_uri = "s3://your-bucket-name/path/to/yourfile.parquet"
view_name = "your_view_name"
client.create_view(s3_uri, view_name)

Creating a View from Local Path:

# Example: Creating a view from a Parquet file in local storage
local_path = "path/to/local/file/yourfile.parquet"
view_name = "your_view_name"
client.create_view(local_path, view_name)

Querying the View

Querying the View to Count the Records:

# Example: Querying the view to count the records
query = "SELECT COUNT(*) FROM your_view_name;"
result_df = client.query(query)
print(result_df)

Creating a New Table from a Query:

# Example: Creating a new table from a query
query = "SELECT * FROM your_view_name WHERE your_column > some_value;"
new_table_name = "new_table_name"
client.query(query, new_table_name)

Listing All Tables


# Example: Listing all tables and views
tables = client.list_tables()
print(tables)

ChristianCasazza / datasharing

readme