avro-kotlin / avro4k

Avro format support for Kotlin
Apache License 2.0
197 stars 37 forks source link

How to read previously written Binary files from v1 #245

Closed rutkowskij closed 4 weeks ago

rutkowskij commented 2 months ago

In previous version there was a AvroEncodeFormat.Binary. Using that you could store data without schema and it was possible to provide an external schema on read. I tried to read previously saved data using AvroSingleObject and the provided schema, but the magic numbers do not match, so it won't work. Have I missed something because I can't find a way to continue supporting previously saved data using the new version of avro4k. Thank you for your great work!

avro4k 1.10.1:

    /**
     * Encodes the record in a binary format without schema information, the most compact format.
     *
     * See https://avro.apache.org/docs/current/spec.html#binary_encoding
     */
    object Binary : AvroEncodeFormat() {
        override fun <T> createOutputStream(
            output: OutputStream,
            schema: Schema,
            converter: (T) -> GenericRecord
        ) = AvroBinaryOutputStream(output, converter, schema)

    }
rutkowskij commented 2 months ago

Oh, nevermind it works with

public fun <T> Avro.decodeFromGenericData( writerSchema: Schema, deserializer: DeserializationStrategy<T>, value: Any?, ): T

Chuckame commented 2 months ago

Yes, please read the migration guide, you have it here: https://github.com/avro-kotlin/avro4k/blob/main/Migrating-from-v1.md#generic-data-serialization

If you think something is missing, please open a PR to add this migrafop example :rocket:

rutkowskij commented 2 months ago

I had to do some digging in both versions to find a solution. Let me know if it's better than what I suggested https://github.com/avro-kotlin/avro4k/pull/247

Chuckame commented 2 months ago

Oh, I just get it. You want to deserialize generic data from bytes, which is not possible for the moment. What is the purpose of using avro4k then ? Are you using avro4k with some kotlin classes declared with @Serializable ? Or are you only encoding/decoding generic data ?

rutkowskij commented 2 months ago

I am using a previous version of avro4k and @Serializable. I would like to update avro4k to version 2.0.0 but I have a lot of data saved as AvroEncodeFormat.Binary and need to maintain backward compatibility

Chuckame commented 2 months ago

This is just a raw pure avro serialization. Using generic data or specific classes will be serialized exactly the same.

Chuckame commented 2 months ago

Hello, I haven't heard about you since 2 weeks. Can you explain a bit you needs ?

need to maintain backward compatibility

As said, whatever the runtime type of your data (generic record or data class), if the content and the schema are the same, the binary representation will be the exact same. Do you need to deserialize generic content, which means you aren't able to know the schema by advance ? Or you know the content so you can write a dedicated kotlin data class ? The final thought is that if you only deal with generic records, then avro4k may not be the absolute solution, as you won't use any kotlin feature, so I would advise you to use the standard apache library.

I'm currently implementing the support of GenericRecord, GenericFixed and GenericEnumSymbol, but it takes time to make it cleanly

rutkowskij commented 2 months ago

Thank you for your patience, I was on vacation hence the late reply. I used the previous version of avro4k in one project where both AvroEncodeFormat.Data (@Serializable with a embedded schema) and AvroEncodeFormat.Binary were used as standard, where due to the large number of small objects the schema was saved in a separate file. I was looking for a way to read both formats in parrallel after updating to the newest version. There is no problem with Data, and for Binary I found a way which I added to the migration instructions in PR247. I can merge it If you think that it may be useful for others (and it seems that it may be because previously there was a possibility to use the Binary format with externally delivered schema). From my perspective, as I found the solution to the problem, there is no need to do anything more here.

Chuckame commented 2 months ago

No problem. I think it's a must to allow handling generic data as we do not always have the corresponding classes. And if working with both generic and specific records needs to use both avro4k and apache avro library, then it's better to unify the API for this kind of usage. By the way, in my company, we also have this mixed needs.

I'm currently working on that solution, with simply using the Any type which will trigger generic encoding or decoding. There is still the logical types to handle during decoding and I'll submit the PR. That's why I've not merged your PR as there is ongoing work to support this use case.

Chuckame commented 1 month ago

Hello back, I'm ready to merge and release. But there is still a gap to fill: logical types.

Do you need to decode LocalDateTime, LocalDate or stuff like that? Or you just use common ints, strings, maps, etc ?

Chuckame commented 1 month ago

After reading many times your messages, I'm really sorry for misunderstanding your initial request: just read previously written binary avro with v2... I've commented accordingly your pr #247 , there is nothing to do on the avro4k side.

// Previously
Avro.default.openInputStream(serializer) { decodeFormat = AvroDecodeFormat.Binary(schema) }
    .from(data).use { avroInputStream -> return avroInputStream.nextOrThrow() }

// Now

val inputStream = ByteArrayInputStream(data)
while (inputStream.remaining() > 0) {
    // If the writer schema corresponds to the specified type
    val element = Avro.decodeFromStream<MyType>(inputStream)

    // If the writer schema does not correspond to the specified type
    val element = Avro.decodeFromStream<MyType>(writerSchema, inputStream)

    // With explicit writer schema and serializer
    val element = Avro.decodeFromStream(writerSchema, serializer, inputStream)
}

EDIT: added the example in this ticket in case of people facing the same issue.