Open almann opened 8 years ago
@tgregg Why was this closed? There's good use cases for this, and at least some scenarios where we can support seeking.
Hello,
I'm trying to learn more about what is this issue here. I'm looking for a way to seek into a specific item inside an Ion file. In the text form, I can seek to the position and start reading there, it works fine, but I can't figure out how to do the same using the binary format. Is this issue related to some feature like this? Or is there already some way to seek and read a specific entry using the binary format?
Thanks.
@wilkerlucio Can you share more information about what you're trying to do? Are you trying to skip forward in a stream until you find the value you're looking for, or do you need to be able to seek back to a value you've seen previously? Can you share the code that you mentioned works with text Ion but not binary Ion?
Sure, I have a system in which I like to index records stored in file data formats. The idea is that when I try to look something up (for example, I want a record with ID 123, from this ID, I can have an index that tells me the record 123 is stored at offset 452315 of a given file). With the offset at hand, I like to open the file, skip to that offset and read the record that starts at that point. This works fine with text format. Here is a snippet demonstrating it:
package com.amazon-ion-encode-demo;
import com.amazon.ion.IonReader;
import com.amazon.ion.IonType;
import com.amazon.ion.IonWriter;
import com.amazon.ion.system.IonReaderBuilder;
import com.amazon.ion.system.IonTextWriterBuilder;
import java.io.*;
public class IonSkipReadDemo {
IonReaderBuilder readerBuilder = IonReaderBuilder.standard();
IonTextWriterBuilder textWriterBuilder = IonTextWriterBuilder.standard();
public static void main(String[] args) {
try {
IonSkipReadDemo demo = new IonSkipReadDemo();
demo.writeFile();
demo.readSkipping(53); // 53 is the byte offset of the record with "world 4"
} catch (Throwable e) {
System.out.println(e.getMessage());
}
}
void writeFile() throws IOException {
try (OutputStream out = new FileOutputStream("file-java.txt");
IonWriter textWriter = textWriterBuilder.build(out)) {
for (long i = 1; i < 1000; i++) {
writeHelloWorld(textWriter, "world " + i);
}
}
}
void readSkipping(long offset) throws IOException {
InputStream in = new FileInputStream("file-java.txt");
in.skip(offset);
try (IonReader reader = readerBuilder.build(in)) {
readHelloWorld(reader);
}
}
void writeHelloWorld(IonWriter writer, String value) throws IOException {
writer.stepIn(IonType.STRUCT); // step into a struct
writer.setFieldName("hello"); // set the field name for the next value to be written
writer.writeString(value); // write the next value
writer.stepOut(); // step out of the struct
}
void readHelloWorld(IonReader reader) {
reader.next(); // position the reader at the first value, a struct
reader.stepIn(); // step into the struct
reader.next(); // position the reader at the first value in the struct
String fieldName = reader.getFieldName(); // retrieve the current value's field name
String value = reader.stringValue(); // retrieve the current value's String value
reader.stepOut(); // step out of the struct
System.out.println(fieldName + " " + value); // prints "hello world"
}
}
But I can't figure out how to do the same with the binary because it must read the header at the beginning, and I'm also not sure how it would handle the local symbol tables in this case (although, for my case, I can just used shared tables, if that helps).
Is there a way to make the same with the binary format?
Thank you for the illustration. I understand what you're trying to do.
This works with text because (in general) text Ion does not require a symbol table. If you seek a text Ion InputStream
to a byte position where a value begins, the text IonReader
will be able to read the value because it is not missing any context. (Note: for Ion 1.0 data only. If new revisions are released, then the reader will need to know what version of the Ion format to read, typically identified by the Ion version marker. The marker $ion_1_0
is implied if missing.)
For binary Ion, you correctly identified part of the problem. Seeking past a symbol table throws away context that the binary IonReader
may need in order to process the values that follow. Additionally, the binary Ion version marker (0xE0 0x01 0x00 0xEA
for Ion 1.0) is always required in order for the reader to consider the data valid. Using shared symbol tables won't fix the problem because your shared symbol table imports are declared in a shared symbol table.
Therefore, rather than calling InputStream.skip
directly, seeking within an Ion stream needs to be done in a way that allows the IonReader
to consume any symbol tables or Ion version markers that may occur between top-level values. The SeekableReader
facet was created to help with use-cases like yours, where you want to pre-process the data to build up an index of values that you can quickly seek back to later. However, currently, the SeekableReader
is only supported when you provide your IonReader
with a byte[]
, not an InputStream
, hence the existence of this issue.
You can still quickly seek past binary Ion values, however; the Ion 1.0 specification optimizes for this case by requiring all values to be prefixed with their length. Skipping a value simply requires the reader to parse the value's length from its header, then seek ahead by that length. The IonReader
will parse any symbol tables and version markers that may occur between the values it skips, so it will always have the context it needs to read the values the follow.
To achieve something similar to what you're attempting above using the functionality available in the library today, I recommend recording value index rather than byte position during your indexing pass. Then, create your IonReader
from the start of the InputStream
, and call IonReader.next()
enough times to position the reader at the desired value index.
For example:
void readSkipping(long valueIndex) throws IOException {
try (IonReader reader = readerBuilder.build(new FileInputStream("file-java.10n"))) {
for (int i = 0; i < valueIndex; i++) {
reader.next();
}
readHelloWorld(reader);
}
}
This assumes you only have one index to revisit; if you have more than one, visiting them all in order using the same IonReader
will be more efficient.
Hello @tgregg, thanks for the detailed response.
Yes, jumping is valid, but it's only performant up to a point. In my use case, I need low latency, and I have huge files (gbs of data in a single file), and also, the file system is a networked one. In this scenario, if I have a record close to the end of a gb sized file, and I have to keep jumping with a networked file system, it won't perform well enough for my requirements.
I'm looking forward to the SeekableReader
supporting InputStream
. I think that will just be what I need to make this work :)
Imported from ION-243,IONJAVA-102