Open pavlexander opened 1 year ago
I don't get it.. if we look at the first candle, first decimal value: 0.08m
(got it from typed reader)
if we convert the value to bits var decimalBits = Decimal.GetBits(0.08M);
the result is 8 0 0 131072
then I try to read values manually (with untyped reader):
using FileStream stream = new FileStream(fileFullPath, FileMode.Open);
using var tf = TeaFile.OpenRead(stream);
using var br = new BinaryReader(stream);
var itemAreaStart = tf.ItemAreaStart;
var openTsColumn = tf.Description.ItemDescription.GetFieldByName("OpenTs");
var openTsColumnOffset = openTsColumn.Offset;
var itemSize = tf.Description.ItemDescription.ItemSize;
var itemsCount = tf.ItemAreaSize / itemSize;
for (int i = 0; i < itemsCount; i++)
{
var itemOffset = i * itemSize;
var startAt = itemAreaStart + itemOffset + openTsColumnOffset;
stream.Seek(startAt, SeekOrigin.Begin);
var openTs = br.ReadUInt32();
var decimalVal = br.ReadDecimal(); // exception here
if (openTs >= from)
{
result.Add(default);
}
}
but get the exception as reported previously:
Decimal constructor requires an array or span of four valid decimal bytes
so I started to dig further and gotten the bytes that represent the first decimal value:
var openTs = br.ReadUInt32();
//var decimalVal = br.ReadDecimal(); // exception
var decimalBytes = br.ReadBytes(16);
var decimalVal = Read(0, decimalBytes);
where Read method is:
public static decimal Read(int startIndex, byte[] buffer)
{
Span<int> int32s = stackalloc int[4];
ReadOnlySpan<byte> bufferSpan = buffer.AsSpan();
for (int i = 0; i < 4; i++)
{
var slice = bufferSpan.Slice(startIndex + i * 4);
int32s[i] = BitConverter.ToInt32(slice);
}
return new decimal(int32s);
}
then I get the same kind of exception as before, but I am able to verify the bits that represent the decimal number:
the bits are: 0 131072 0 8 as a reminder, the correct values are: 8 0 0 131072
so to me it does look like TeaFiles is saving the bytes is some weird order. Hence I can't deserialize the value manually. The untyped reader seems to be broken to me.. Unless I am doing something wrong, of course.
To answer you original question "I would like to select/read only a set of data from file based on criteria. Is there an optimal approach for doing it?":
Since the advent of 64bit machines, the secret among quant analysists of larger time series is to store structs in files and memory map them. Soon you get the problem that you have many files and do not know what kind of structs they hold. TeaFiles solve this problem by adding an (optional) description to the file. Besides that, TeaFiles just store raw structs.
Reading selected fields of such structs does not differ from reading the whole struct when the file is read via memory mapping. For sure, all fields of the structs are mapped and it can be useful to create a derived file that holds only those fields that are often read afterwards.
Reading selected fields without memory mapping is easy by reading the whole struct and then reading the required fields or by creating a reader that skips the non-required fields to avoid composing numbers like a decimal from the raw bytes. That said, TeaFiles use BinaryReader for that purpsoe which is expected to be solid but maybe (reall do not know atm) that can be done faster.
Your detailed report about problems above is a different thing, I hope to find the time to dig into that soon.
I would like to select/read only a set of data from file based on criteria. Is there an optimal approach for doing it?
attempt 1
model
method
Given that my data in file is sorted by
OpenTs
I would like to filter out the values that are not within a specific range as in example above.issue
This approach is really inefficient, because the whole
Item
is being read and mapped right away. It's slow. Not solving the problem.attempt 2
I have also tried using the unmapped approach. But exception is thrown upon read
I have managed to extract part of the data that causes the issue. https://github.com/pavlexander/testfile/blob/main/ETHBTC_big.7z
There were no issues with 10k, 50k, 100k of records. But at 1 mil of records I started getting the error.. Please download, unpack the file, then use following code to repro:
issue
even if this solution worked there is no guarantee that it would work faster than approach 1. In fact, on a smaller dataset where no exceptions are thrown - on my machine approach 1 performs many times faster than approach 2.. If we put the error aside - I also want to know how to map an
item
tostruct
..conclusion
the original question still stands - how to filter out the data based on criteria and avoid reading all file..