apache / cassandra-gocql-driver

GoCQL Driver for Apache Cassandra®
https://cassandra.apache.org/
Apache License 2.0
2.58k stars 622 forks source link

suggestion for Unmarshal version 2 #1720

Open illia-li opened 1 year ago

illia-li commented 1 year ago

I was doing another project and i had to look into gocql Unmarshal . And i saw that can do gocql Unmarshal more fast and clear, here description: Since I considered this from the user's side, I can say that users can be divided into two types:

  1. Those who need fast clear work, possibly with small predictable variations,
  2. Those who need fast clear work, with full or partial customization.

Currently gocql Unmarshal realizes it through the same path in the code. That's why, there are a lot of "if", "switch", "reflect" in the code. Too much uncertainty in what has come and what needs to be given, it`s causes fast clear work to suffer.

Solution - separate paths in code for different user groups needs and shifting the choice point of the Unmarshals path to the stage of define column types or earlier stage.

Approximate solution description:

  1. Each type has its own "default Unmarshal interface",
  2. For each type Implemented several variations of "Unmarshal interface",
  3. Each cluster has its own set of "default Unmarshal interfaces",
  4. The user is given the opportunity to choose the "default Unmarshal interface" from the set (described in paragraph 2) or replace it with custom "Unmarshal interface", within the entire cluster and/or a specific query. Or not choose any for processing by defaults.

The amount of changes will not be small, so, i can take to make gocql Unmarshal version 2, if you agree with this described current Unmarshal situation and with solutions concept.

martin-sucha commented 1 year ago

Hello!

Indeed the reflection in gocql.Unmarshal is slow. At kiwi.com, we use https://github.com/scylladb/gocqlx and https://github.com/kiwicom/easycql in production for faster unmarshaling.

A better interface directly in gocql would be nice indeed.

It is not clear to me how the interface you are proposing looks like. Please provide code samples.

Some of the points from my experience: Currently it is not possible to customize the unmarshal behavior without implementing gocql.Unmarshaler on a particular type and customize the mapping between types and unmarshaling function.

So the new interface should not have a global gocql.Unmarshal(), but instead pass a function to unmarshal sub-fields as an argument. Something like:

type Unmarshaler2 interface {
    Unmarshal2(info TypeInfo, data []byte, next Unmarshaler2)
}

And we might need to also pass the target value to unmarshal to if the unmarshaling function so that the unmarshaling implementation and type are not coupled.

illia-li commented 1 year ago

Hello! To provide a sample of code changes with a description (which will be enough to see how it will work) will take me about a week. Obviously, some difficulties will need to be solved with complex types, this will be included in the sample of code changes.

illia-li commented 1 year ago

After much investigation and remakes i makes this one #1730. This is not what I promised, but due work with unmarshlling i was understand that without predictable interface for row/rows unmarshalling productive unmarshalling process impossible. I was make much benchmark tests (in #1730 is only a small part of it`s) and saw necessity of row/rows writers because unmarshaling values by slice of interfaces the most unproductive way. The most productive rows unmarshaling way - unmarshal rows to structs and I chose this one as the first thing to do. I would really like to know your opinion about PR conception, to continue work on it. How i think, next steps - make functional tests and make decode functions to help users make their own unmarshallers.

illia-li commented 1 year ago

In https://github.com/gocql/gocql/issues/1720#issuecomment-1709651467 you suggested:

type Unmarshaler2 interface {
    Unmarshal2(info TypeInfo, data []byte, next Unmarshaler2)
}

I have tested various techniques for transferring unmarshal functions. The slowest techniques are transferring functions into the functions. It`s seems that it is very difficult for the compiler to optimize such code.

illia-li commented 1 year ago

To optimize the code, I would suggest considering the process of transferring and preparing data and removing unnecessary actions from there. In my estimation, the first mistake is the assumption that the user does not know what he will receive as a result of the request. However, in most cases, the user knows what he will receive as a result of the request. This means that he know column types, and data unmarshal way. Therefore, i propose to give users the opportunity to send functions for unmarshal rows, together with the request or putting this functions in Iter. Also it`s can help to use less buffer. If users can send unmarshal functions together with the request - gocql will be able to read response data from io.Reader row by row.

And to do it so that they can choose how to make row unmarshal. There are many questions in the unmarshal process that only the user himself can answer. For example: 1) Does the row transform on output into one structure or several, or does it migrate in the original representation? 2) Recognize null separately from zero or not? 3) What go type is needed at the output? If the same code is designed to handle all these possible options, it means that there will be a lot of conditions in the code, and this will lead to a decrease in speed. To avoid this, there is only one way - to give the user the opportunity to choose the unmarshaling way.

illia-li commented 1 year ago

Another point that slows down the current implementation very much - a lot of unnecessary length checks. Possible causes of unmarshal errors: 1) Broken data. 2) Insufficient data length. 3) The unmarshal function is not working correctly.

Cause 3 can be determined via functional testing. Cause 2 can be determined via built-in panic. Cause 1 can be determined via triggered panic. This means that there is no need to make new functions of checking the length (via ifs conditions). Instead, it will be faster to use the built-in length checking functions and recover the code from panic. In #1730 have a benchmark pkg/writers/row/bench_anti_panic_test.go. By this benchmark you can see speed degradation when using a lot of 'ifs'. With increasing depth of functions with checks of length - the speed decreases exponentially.

illia-li commented 1 year ago

For the above reasons, I propose make an interface for rows writers, like io.Writer with specific addons for rows unmarshal.

type Rows interface {
    Prepare(rows int32)
    WriteRow(data []byte) (w int32, err error)
    WriteRows(data []byte) (w int32, err error)
    ColumnsInRow() int
}
illia-li commented 1 year ago

About cql types unmarshal... After reading value length we have a choose; some value, zero value and null value. It`s means: 1) If unmarshaller read value length them-self, then interface must be like:

type Unmarshaler2 interface {
    Unmarshal2( data []byte) int32
}

2) If unmarshaller do not read value length, then interface must be like:

type Unmarshaler2 interface {
    Unmarshal(data []byte)
        Zero()
        Null()
}