ClickHouse / clickhouse-cpp

C++ client library for ClickHouse
Apache License 2.0
305 stars 159 forks source link

Version to use for new project / blob data ingestion #382

Closed gparlamas closed 2 months ago

gparlamas commented 3 months ago

Hey folks,

I am planning to use your library in order to ingest some time series data into Clickhouse. I have a couple of questions:

Thanks in advance, George

gparlamas commented 3 months ago

So far I been using the below, is this the recommended / most efficient way?

 struct Sample
 {
     int64_t data1;
     int64_t data2;
     int64_t data3;
 };

 std::ostream& operator << (std::ostream& str, const Sample& s)
 { 
       str << "Sample:" << s.data1 << ' ' << s.data2 << ' '<< s.data3;
       return str;
 }

Client client(ClientOptions().SetHost("localhost"));

client.Execute("CREATE TABLE IF NOT EXISTS default.numbers (id UInt16, tm DateTime64(6), msg Array(UInt8)) ENGINE = Memory");

{
    Block block;
    auto buffer = std::make_shared<ColumnUInt8>();

    auto id = std::make_shared<ColumnUInt16>();
    auto tm = std::make_shared<ColumnDateTime64>(6);
    auto blobArray = std::make_shared<ColumnArray>(std::make_shared<ColumnUInt8>());
    auto blob = std::make_shared<ColumnUInt8>();
    uint16_t counter{0};
    auto AppendColumns = [&](Sample s)
    {
      id->Append(++counter);
      auto now = std::chrono::system_clock::now().time_since_epoch();
      auto micros = std::chrono::duration_cast<std::chrono::microseconds>(now).count();
      tm->Append(micros);

      uint8_t buffer[1024]{};
      memcpy(buffer, &s, sizeof(Sample)); 
      ArrayInput input(buffer, sizeof(Sample));
      blob->LoadBody(&input, input.Avail());
      blobArray->AppendAsColumn(blob);
    };

    AppendColumns(Sample{122, 111, 133});
    AppendColumns(Sample{22, 11, 33});
    AppendColumns(Sample{2, 1, 3});

    block.AppendColumn("id"  , id);
    block.AppendColumn("tm", tm);
    block.AppendColumn("msg", blobArray);

    client.Insert("default.numbers", block);
}

client.Select("SELECT id, tm, msg FROM default.numbers", [] (const Block& block)
    {
        for (size_t i = 0; i < block.GetRowCount(); ++i) {
            auto id = block[0]->As<ColumnUInt16>()->At(i);
            auto tm = block[1]->As<ColumnDateTime64>()->At(i);
            auto blob = block[2]->As<ColumnArray>()->GetAsColumnTyped<ColumnUInt8>(i);
            std::cout << id << ' ' << tm << ' ';
            std::cout << *reinterpret_cast<Sample*>(blob->GetWritableData().data()) << std::endl;
        }
    }
);

`

Enmk commented 2 months ago

Hi @gparlamas, sorry for long reply.

First of all, 3.0.0 is not out yet, and there are no concrete plans for a release date. So please choose the most recent release, (as of now, v2.5.1, or just use master's head)

Second, your snippet looks about right, except for the Array creation. The easiest (and most performant) way is to use ColumnArrayT type-aware wrapper:

auto blobArray = std::make_shared<clickhouse::ColumnArrayT<ColumnUInt64>>();
blobArray->Append(/*vector, or c-array (,or anything iterable with std::begin() and std::end() really) of items, */ buffer);
block.AppendColumn("msg", blobArray);

Also, you may want to use String instead of Array, if your data is some sort of binary -- that way it can be organized more effectively on sender/receiver side and maybe more convenient to work with in SQL (but that highly depends on WHAT kind of data that you have).

gparlamas commented 2 months ago

Hi @Enmk,

I rather use the latest/master, its been a while since your last official release so would prefer to use a version with latest improvements / fixes unless you think some of the recent changes are not battle tested / ready for prime time. Any particular reason you are not planning the release of 3.0.0 yet?

Thanks for suggesting ColumnArrayT, it's exactly what I need!

Regarding using String instead of Array, it crossed my mind to convert the binary msgs into hex or base64 and storing them as a String, but I rather not go down that path. The binary msgs won't be used interactively with SQL; they will be processed by another application that will load and cast them back to their original binary form.

Enmk commented 2 months ago

You can perfectly store binary in String, without encoding as hex (or any other).

As for 3.0.0 the main reason is basically not enough resources to push couple of important features/fixes.

gparlamas commented 2 months ago

Do you have any example how to do this? ColumnString::Append(std::string_view str) or ColumnString::LoadBody() ? I would like to avoid nasty casts if possible. In terms of performance, is there a big difference between ColumnArrayT & String?

Enmk commented 2 months ago

There is Append family of methods, that take a string that can have arbitrary binary data, including nulls (for every one except const char* overload, obviously). Those methods are copying data into the column itself, which should be acceptable in most cases.

However, if you have some data which lifetime you can grantee to exceed one of ColumnString instance usage, you may use ColumnString::AppendNoManagedLifetime, which will just reference the value inside the column, without copying any memory on the client.

Regarding performance: from server standpoint, String and Array are somewhat similarly organized, so there should be no big difference. However, if you can avoid excessive copying (ColumnString::AppendNoManagedLifetime), difference on client side might be considerable, depending on your use case.

And, by the way, LoadBody, SaveBody, and any other load-and-save methods on any of the columns are not expected to be directly used by library clients.

gparlamas commented 2 months ago

Yea I got the impression LoadBody et al. are not really part of the public interface - I couldn't resist its zero copy semantics... ;-)

Thanks for helping @Enmk