Closed dmdmishra closed 3 years ago
@dmdmishra Airbyte already has AWS S3 destination connector. Supports csv file and https://github.com/airbytehq/airbyte/pull/3908 is adding parquet format. Do you have another use case or as source?
Hi,
Yes, we want to load data into aws S3 . We are extracting data from oracle.
Does airbyte currently support data extraction from oracle or sqlserver in avro format and get it loaded into aws S3.
Also I wanted to understand if airbyte support file movement into aws S3 from onpremise?
Thanks, Deepak
@dmdmishra yes, Airbyte has Oracle and SQL Server as source connector and you can add S3 AWS as the destination. AWS S3 connector only supports csv format, but parquet and other formats will be supported in the future.
Recommend to you read the quick start guide
Does airbyte currently support data extraction from oracle or sqlserver in avro format and get it loaded into aws S3.
@dmdmishra, we are working on the Avro format on S3 at this moment. Should have some updates there either late this week or early next week.
That's great news, I will hold on till I hear more from airbyte team.
Can I also check with you if after extraction of data can we perform data validation before we invest it into S3?
@dmdmishra, sorry about the delayed reply.
if after extraction of data can we perform data validation
We will create an Avro schema based on the Json schema of the Airbyte stream from the source. So if the source connector can provide a meaningful Json schema, it will be transformed into a relatively good Avro schema, and the record will be validated against it. The Avro schema is only "relatively" good, because not all Json schema can be mapped to an Avro schema, and the initial version probably won't support keywords like allOf
or oneOf
.
You can find the documentation about the schema conversion here:
(It is the s3.md
file in PR #3908.)
However, I am not sure if this answers your question. I think the validation will always pass, as long as the source connector does generate the data based on its Json schema.
Also not all sources will provide a meaningful Json schema. For example, data in mongo db is schemaless, and can be any json object. In those cases, no Avro schema can be generated, and the data cannot be validated. We are still thinking about how to support sources like that.
I want to sync s3 bucket and my use case is like many analytics provider (like segment) gives a facility to sync data on s3 and we can fetch that data using airbyte source-s3 for further computation
I want to sync s3 bucket and my use case is like many analytics provider (like segment) gives a facility to sync data on s3 and we can fetch that data using airbyte source-s3 for further computation
Hey @blotouta2, we already have a File source connector that can read from S3. Documentation here: https://docs.airbyte.io/integrations/sources/file
That should work for your use case.
@tuliren Source-File read only single file from any source whereas i want to sync whole s3 bucket . Source-file able to read path like s3://gdelt-open-data/events/20190914.export.csv requirement is like i want to sync path like s3://gdelt-open-data/events
I see. That makes sense. We will either make the File source connector be able to read multiple files or create a dedicated S3 source.
yes agree with the above would like to sync whole bucket so we can easily get a new file once added.
closed by https://github.com/airbytehq/airbyte/pull/4990. Issues exist for building in more file format support.
Hello team,
I believe it would be best if we can have an AWS S3 connector built in with Airbyte, it would open a lot of path for people who wants to build pipelines on top of AWS S3 and may perform data ingestion, run data science and ML in top of S3 directly.
Thanks