amazon-ion / ion-java

Java streaming parser/serializer for Ion.
https://amazon-ion.github.io/ion-docs/
Apache License 2.0
866 stars 110 forks source link

SeekableReader not available from readers over InputStream #17

Open almann opened 8 years ago

almann commented 8 years ago

Imported from ION-243,IONJAVA-102

toddjonker commented 6 years ago

@tgregg Why was this closed? There's good use cases for this, and at least some scenarios where we can support seeking.

wilkerlucio commented 1 year ago

Hello,

I'm trying to learn more about what is this issue here. I'm looking for a way to seek into a specific item inside an Ion file. In the text form, I can seek to the position and start reading there, it works fine, but I can't figure out how to do the same using the binary format. Is this issue related to some feature like this? Or is there already some way to seek and read a specific entry using the binary format?

Thanks.

tgregg commented 1 year ago

@wilkerlucio Can you share more information about what you're trying to do? Are you trying to skip forward in a stream until you find the value you're looking for, or do you need to be able to seek back to a value you've seen previously? Can you share the code that you mentioned works with text Ion but not binary Ion?

wilkerlucio commented 1 year ago

Sure, I have a system in which I like to index records stored in file data formats. The idea is that when I try to look something up (for example, I want a record with ID 123, from this ID, I can have an index that tells me the record 123 is stored at offset 452315 of a given file). With the offset at hand, I like to open the file, skip to that offset and read the record that starts at that point. This works fine with text format. Here is a snippet demonstrating it:


package com.amazon-ion-encode-demo;

import com.amazon.ion.IonReader;
import com.amazon.ion.IonType;
import com.amazon.ion.IonWriter;
import com.amazon.ion.system.IonReaderBuilder;
import com.amazon.ion.system.IonTextWriterBuilder;

import java.io.*;

public class IonSkipReadDemo {
    IonReaderBuilder readerBuilder = IonReaderBuilder.standard();
    IonTextWriterBuilder textWriterBuilder = IonTextWriterBuilder.standard();

    public static void main(String[] args) {
        try {
            IonSkipReadDemo demo = new IonSkipReadDemo();

            demo.writeFile();

            demo.readSkipping(53); // 53 is the byte offset of the record with "world 4"
        } catch (Throwable e) {
            System.out.println(e.getMessage());
        }
    }

    void writeFile() throws IOException {
        try (OutputStream out = new FileOutputStream("file-java.txt");
             IonWriter textWriter = textWriterBuilder.build(out)) {
            for (long i = 1; i < 1000; i++) {
                writeHelloWorld(textWriter, "world " + i);
            }
        }
    }

    void readSkipping(long offset) throws IOException {
        InputStream in = new FileInputStream("file-java.txt");

        in.skip(offset);

        try (IonReader reader = readerBuilder.build(in)) {
            readHelloWorld(reader);
        }
    }

    void writeHelloWorld(IonWriter writer, String value) throws IOException {
        writer.stepIn(IonType.STRUCT);  // step into a struct
        writer.setFieldName("hello");   // set the field name for the next value to be written
        writer.writeString(value);    // write the next value
        writer.stepOut();               // step out of the struct
    }

    void readHelloWorld(IonReader reader) {
        reader.next();                                // position the reader at the first value, a struct
        reader.stepIn();                              // step into the struct
        reader.next();                                // position the reader at the first value in the struct
        String fieldName = reader.getFieldName();     // retrieve the current value's field name
        String value = reader.stringValue();          // retrieve the current value's String value
        reader.stepOut();                             // step out of the struct
        System.out.println(fieldName + " " + value);  // prints "hello world"
    }
}

But I can't figure out how to do the same with the binary because it must read the header at the beginning, and I'm also not sure how it would handle the local symbol tables in this case (although, for my case, I can just used shared tables, if that helps).

Is there a way to make the same with the binary format?

tgregg commented 1 year ago

Thank you for the illustration. I understand what you're trying to do.

This works with text because (in general) text Ion does not require a symbol table. If you seek a text Ion InputStream to a byte position where a value begins, the text IonReader will be able to read the value because it is not missing any context. (Note: for Ion 1.0 data only. If new revisions are released, then the reader will need to know what version of the Ion format to read, typically identified by the Ion version marker. The marker $ion_1_0 is implied if missing.)

For binary Ion, you correctly identified part of the problem. Seeking past a symbol table throws away context that the binary IonReader may need in order to process the values that follow. Additionally, the binary Ion version marker (0xE0 0x01 0x00 0xEA for Ion 1.0) is always required in order for the reader to consider the data valid. Using shared symbol tables won't fix the problem because your shared symbol table imports are declared in a shared symbol table.

Therefore, rather than calling InputStream.skip directly, seeking within an Ion stream needs to be done in a way that allows the IonReader to consume any symbol tables or Ion version markers that may occur between top-level values. The SeekableReader facet was created to help with use-cases like yours, where you want to pre-process the data to build up an index of values that you can quickly seek back to later. However, currently, the SeekableReader is only supported when you provide your IonReader with a byte[], not an InputStream, hence the existence of this issue.

You can still quickly seek past binary Ion values, however; the Ion 1.0 specification optimizes for this case by requiring all values to be prefixed with their length. Skipping a value simply requires the reader to parse the value's length from its header, then seek ahead by that length. The IonReader will parse any symbol tables and version markers that may occur between the values it skips, so it will always have the context it needs to read the values the follow.

To achieve something similar to what you're attempting above using the functionality available in the library today, I recommend recording value index rather than byte position during your indexing pass. Then, create your IonReader from the start of the InputStream, and call IonReader.next() enough times to position the reader at the desired value index.

For example:

    void readSkipping(long valueIndex) throws IOException {
        try (IonReader reader = readerBuilder.build(new FileInputStream("file-java.10n"))) {
            for (int i = 0; i < valueIndex; i++) {
                reader.next();
            }
            readHelloWorld(reader);
        }
    }

This assumes you only have one index to revisit; if you have more than one, visiting them all in order using the same IonReader will be more efficient.

wilkerlucio commented 1 year ago

Hello @tgregg, thanks for the detailed response.

Yes, jumping is valid, but it's only performant up to a point. In my use case, I need low latency, and I have huge files (gbs of data in a single file), and also, the file system is a networked one. In this scenario, if I have a record close to the end of a gb sized file, and I have to keep jumping with a networked file system, it won't perform well enough for my requirements.

I'm looking forward to the SeekableReader supporting InputStream. I think that will just be what I need to make this work :)