airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.55k stars 4.01k forks source link

New Source: AWS S3 #3965

Closed dmdmishra closed 3 years ago

dmdmishra commented 3 years ago

Hello team,

I believe it would be best if we can have an AWS S3 connector built in with Airbyte, it would open a lot of path for people who wants to build pipelines on top of AWS S3 and may perform data ingestion, run data science and ML in top of S3 directly.

Thanks

marcosmarxm commented 3 years ago

@dmdmishra Airbyte already has AWS S3 destination connector. Supports csv file and https://github.com/airbytehq/airbyte/pull/3908 is adding parquet format. Do you have another use case or as source?

dmdmishra commented 3 years ago

Hi,

Yes, we want to load data into aws S3 . We are extracting data from oracle.

Does airbyte currently support data extraction from oracle or sqlserver in avro format and get it loaded into aws S3.

Also I wanted to understand if airbyte support file movement into aws S3 from onpremise?

Thanks, Deepak

marcosmarxm commented 3 years ago

@dmdmishra yes, Airbyte has Oracle and SQL Server as source connector and you can add S3 AWS as the destination. AWS S3 connector only supports csv format, but parquet and other formats will be supported in the future.

Recommend to you read the quick start guide

tuliren commented 3 years ago

Does airbyte currently support data extraction from oracle or sqlserver in avro format and get it loaded into aws S3.

@dmdmishra, we are working on the Avro format on S3 at this moment. Should have some updates there either late this week or early next week.

dmdmishra commented 3 years ago

That's great news, I will hold on till I hear more from airbyte team.

Can I also check with you if after extraction of data can we perform data validation before we invest it into S3?

tuliren commented 3 years ago

@dmdmishra, sorry about the delayed reply.

if after extraction of data can we perform data validation

We will create an Avro schema based on the Json schema of the Airbyte stream from the source. So if the source connector can provide a meaningful Json schema, it will be transformed into a relatively good Avro schema, and the record will be validated against it. The Avro schema is only "relatively" good, because not all Json schema can be mapped to an Avro schema, and the initial version probably won't support keywords like allOf or oneOf.

You can find the documentation about the schema conversion here:

https://github.com/airbytehq/airbyte/blob/b5f5ca3939deac882a69e17353384dd088180534/docs/integrations/destinations/s3.md#data-schema

(It is the s3.md file in PR #3908.)

However, I am not sure if this answers your question. I think the validation will always pass, as long as the source connector does generate the data based on its Json schema.

Also not all sources will provide a meaningful Json schema. For example, data in mongo db is schemaless, and can be any json object. In those cases, no Avro schema can be generated, and the data cannot be validated. We are still thinking about how to support sources like that.

blotouta2 commented 3 years ago

I want to sync s3 bucket and my use case is like many analytics provider (like segment) gives a facility to sync data on s3 and we can fetch that data using airbyte source-s3 for further computation

tuliren commented 3 years ago

I want to sync s3 bucket and my use case is like many analytics provider (like segment) gives a facility to sync data on s3 and we can fetch that data using airbyte source-s3 for further computation

Hey @blotouta2, we already have a File source connector that can read from S3. Documentation here: https://docs.airbyte.io/integrations/sources/file

That should work for your use case.

blotouta2 commented 3 years ago

@tuliren Source-File read only single file from any source whereas i want to sync whole s3 bucket . Source-file able to read path like s3://gdelt-open-data/events/20190914.export.csv requirement is like i want to sync path like s3://gdelt-open-data/events

tuliren commented 3 years ago

I see. That makes sense. We will either make the File source connector be able to read multiple files or create a dedicated S3 source.

schlattk commented 3 years ago

yes agree with the above would like to sync whole bucket so we can easily get a new file once added.

Phlair commented 3 years ago

closed by https://github.com/airbytehq/airbyte/pull/4990. Issues exist for building in more file format support.