crate / crate-clients-tools

Clients, tools, and integrations for CrateDB.
https://crate.io/docs/clients/
Apache License 2.0
2 stars 1 forks source link

ingestr: CLI tool to copy data between any databases #86

Open michabbb opened 8 months ago

michabbb commented 8 months ago

Problem Statement

switching from any database to crate is not always easy, tools such https://github.com/bruin-data/ingestr could help a lot also: syncing data to crate is not that easy as well. would be nice to see a tool that is able to sync data from MySQL to crate because at the moment I need tons of bash scripts to make this work (mysqldump -> transform to json -> crash.....)

Possible Solutions

support this project: https://github.com/bruin-data/ingestr 😏

Considered Alternatives

No response

matriv commented 8 months ago

@michabbb Thx for opening this issue, it belongs better to this repo, as it's an integration tool. Please also take a look at: https://github.com/crate/cratedb-flink-jobs where you can use the power of flink to migrate data between many different datasources and CrateDB.

amotl commented 8 months ago

Hi Michael,

thank you for writing in. ingestr indeed looks like an excellent tool, thanks for sharing. Can you elaborate what it would take to make it support CrateDB? We see ingestr might be based on SQLAlchemy, so if all plays well, it might work out of the box already, at least from a connectivity perspective?

With kind regards, Andreas.

michabbb commented 8 months ago

@amotl I don't know the tool, I just saw it today the first time and my very first thought was: where is crate? 😏 So I just wanted to let you know, that's all

amotl commented 8 months ago

All right, thanks. We will look into it!

michabbb commented 8 months ago

@matriv, thanks for the note, but this would force someone to learn Flink and Kafka first, and build the whole infrastructure for it; thatΒ΄s a monster for someone who sees that the first time, sorry πŸ™ˆ so having a tool that is able to sync with only one command line, would be a blessing. that's why I wanted to mention ingestr here 😏

matriv commented 8 months ago

You don't need Kafka if you migrate from let's say MySQL, just two jdbc connections, one for MySQL as a source and one to CrateDB as a sink. Of course, a tool like your suggestion would ease things, I just wanted to point to a current alternative. Thx again for bringing this up.

amotl commented 8 months ago

One variant of our dream tool for conducting effortless data I/O would have a convenient and fluent Python API and command line interface like how ingestr is doing it, with the power of Apache Flink under the hood.

In this spirit, we are occasionally exploring that topic over at those projects, both being descendants or spin-offs of Kotori in one way or another.

As we see @bruin-data is doing an excellent job over there (thanks, @karakanb!), we will surely adjust our efforts, and will certainly start building upon and eventually contributing to ingestr on corresponding "universal I/O" matters, instead of rolling our own adapter/dialect broker.

karakanb commented 8 months ago

Thanks for tagging me folks, and thanks a lot for your kind words on ingestr! ❀️

I didn't know about crate before, but based on what I understand from the repository description if it is PostgreSQL-compatible, that means a regular Postgres ingestion through ingestr can work out-of-the-box with crate as well, I'd encourage you to try it out.

if that doesn't work: ingestr utilizes dlt destinations for ease of use, and we can always develop custom destinations. The better way forward would be -if you'd be willing- to contribute directly to dlt, which means both ingestr, as well as other users of dlt, can benefit from that directly. If, for any reason, contributing to dlt is not an option, I'd happily accept a PR on the ingestr side.

amotl commented 8 months ago

Dear Burak,

thanks for your quick reply, and also for telling us about dlt. We had it on the radar, but now that we learn from you that this is the home of corresponding pipeline elements you are using in ingestr, we will add a focus to improve compatibility with CrateDB, either by contributing, or by adding corresponding documentation in one way or another.

With kind regards, Andreas.

amotl commented 1 week ago

Hi again,

in order to approach the integration with dlt/ingestr, we just tried to run the canonical dlt example ^1 with PostgreSQL details from ^2 on CrateDB.

pipeline = dlt.pipeline(
    pipeline_name='chess_pipeline',
    destination=dlt.destinations.postgres("postgresql://crate:@localhost/"),
    dataset_name='doc',
)

Unfortunately, it trips right away.

psycopg2.errors.InternalError_: line 26:19: extraneous input 'UNIQUE' expecting {',', ')'}

That probably means we should contribute an adapter for CrateDB to dlt?

With kind regards, Andreas.

amotl commented 1 week ago

When trying to use the SQLAlchemy destination like:

destination=dlt.destinations.sqlalchemy("crate://crate:@localhost/")

the example bails out with:

dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage sync with exception:

<class 'dlt.destinations.exceptions.DatabaseTransientException'>
(crate.client.exceptions.ProgrammingError) RelationUnknown[Relation 'doc._dlt_pipeline_state' unknown]

It looks like the process does not aim to create the _dlt_pipeline_state table.

amotl commented 1 week ago

When running the example on PostgreSQL, we have been able to discover a few DDL statements [^1]. Looking at the particular one for creating _dlt_pipeline_state:

CREATE TABLE IF NOT EXISTS "doc"."_dlt_pipeline_state" (
  "version" bigint  NOT NULL,
  "engine_version" bigint  NOT NULL,
  "pipeline_name" varchar  NOT NULL,
  "state" varchar  NOT NULL,
  "created_at" timestamp with time zone  NOT NULL,
  "version_hash" varchar  ,
  "_dlt_load_id" varchar  NOT NULL,
  "_dlt_id" varchar UNIQUE NOT NULL);

and running it on CrateDB after removing the UNIQUE constraint, we can observe a classic problem that CrateDB does not support column names starting with underscores.

InvalidColumnNameException["_dlt_load_id" conflicts with system column pattern]

[^1]: We don't know why the DDL statements are not automatically issued when using the SQLAlchemy destination with the CrateDB dialect. Most probably, we are doing something wrong.