extracting only a part of the tea file

pavlexander commented 1 year ago

I would like to select/read only a set of data from file based on criteria. Is there an optimal approach for doing it?

attempt 1

model

    public struct CandleInDbNew
    {
        public uint OpenTs;

        public decimal OpenPrice;
        public decimal HighPrice;
        public decimal LowPrice;
        public decimal ClosePrice;

        public uint TradeCount;

        public decimal Volume;
        public decimal QuoteAssetVolume;
        public decimal TakerBuyBaseAssetVolume;
        public decimal TakerBuyQuoteAssetVolume;
    }

method

        public List<CandleInDbNew> GetCandlesInRange(
            string fileFullPath,
            uint from)
        {
            var result = new List<CandleInDbNew>();

            if (!File.Exists(fileFullPath))
            {
                return result;
            }

            using (var tf = TeaFile<CandleInDbNew>.OpenRead(fileFullPath,
                    ItemDescriptionElements.FieldNames |
                    ItemDescriptionElements.FieldTypes |
                    ItemDescriptionElements.FieldOffsets |
                    ItemDescriptionElements.ItemSize))
            {
                foreach (var item in tf.Items)
                {
                    if (item.OpenTs >= from)
                        result.Add(item);
                }
            }

            return result;
        }

Given that my data in file is sorted by OpenTs I would like to filter out the values that are not within a specific range as in example above.

issue

This approach is really inefficient, because the whole Item is being read and mapped right away. It's slow. Not solving the problem.

attempt 2

I have also tried using the unmapped approach. But exception is thrown upon read

System.IO.IOException: 'Decimal constructor requires an array or span of four valid decimal bytes.'

I have managed to extract part of the data that causes the issue. https://github.com/pavlexander/testfile/blob/main/ETHBTC_big.7z

There were no issues with 10k, 50k, 100k of records. But at 1 mil of records I started getting the error.. Please download, unpack the file, then use following code to repro:

            var result = new List<CandleInDbNew>();

            using (var tf = TeaFile.OpenRead("ETHBTC_big.tea")) // exception here
            {
                var openTsColumn = tf.Description.ItemDescription.GetFieldByName("OpenTs");

                foreach (Item item in tf.Items)
                {
                    var openTs = (uint)openTsColumn.GetValue(item);

                    if (openTs >= 1692190740)
                        result.Add(default); // temporary
                }
            }

issue

even if this solution worked there is no guarantee that it would work faster than approach 1. In fact, on a smaller dataset where no exceptions are thrown - on my machine approach 1 performs many times faster than approach 2.. If we put the error aside - I also want to know how to map an item to struct..

conclusion

the original question still stands - how to filter out the data based on criteria and avoid reading all file..

pavlexander commented 1 year ago

attempt 3

I don't get it.. if we look at the first candle, first decimal value: 0.08m (got it from typed reader)

if we convert the value to bits var decimalBits = Decimal.GetBits(0.08M); the result is 8 0 0 131072

then I try to read values manually (with untyped reader):

using FileStream stream = new FileStream(fileFullPath, FileMode.Open);
using var tf = TeaFile.OpenRead(stream);
using var br = new BinaryReader(stream);

var itemAreaStart = tf.ItemAreaStart;
var openTsColumn = tf.Description.ItemDescription.GetFieldByName("OpenTs");
var openTsColumnOffset = openTsColumn.Offset;
var itemSize = tf.Description.ItemDescription.ItemSize;
var itemsCount = tf.ItemAreaSize / itemSize;
for (int i = 0; i < itemsCount; i++)
{
    var itemOffset = i * itemSize;
    var startAt = itemAreaStart + itemOffset + openTsColumnOffset;
    stream.Seek(startAt, SeekOrigin.Begin);

    var openTs = br.ReadUInt32();
    var decimalVal = br.ReadDecimal(); // exception here

    if (openTs >= from)
    {
        result.Add(default);
    }
}

but get the exception as reported previously:

Decimal constructor requires an array or span of four valid decimal bytes

so I started to dig further and gotten the bytes that represent the first decimal value:

                var openTs = br.ReadUInt32();

                //var decimalVal = br.ReadDecimal(); // exception
                var decimalBytes = br.ReadBytes(16);
                var decimalVal = Read(0, decimalBytes);

where Read method is:

        public static decimal Read(int startIndex, byte[] buffer)
        {
            Span<int> int32s = stackalloc int[4];
            ReadOnlySpan<byte> bufferSpan = buffer.AsSpan();
            for (int i = 0; i < 4; i++)
            {
                var slice = bufferSpan.Slice(startIndex + i * 4);
                int32s[i] = BitConverter.ToInt32(slice);
            }
            return new decimal(int32s);
        }

then I get the same kind of exception as before, but I am able to verify the bits that represent the decimal number:

the bits are: 0 131072 0 8 as a reminder, the correct values are: 8 0 0 131072

so to me it does look like TeaFiles is saving the bytes is some weird order. Hence I can't deserialize the value manually. The untyped reader seems to be broken to me.. Unless I am doing something wrong, of course.

thulka commented 1 year ago

To answer you original question "I would like to select/read only a set of data from file based on criteria. Is there an optimal approach for doing it?":

Since the advent of 64bit machines, the secret among quant analysists of larger time series is to store structs in files and memory map them. Soon you get the problem that you have many files and do not know what kind of structs they hold. TeaFiles solve this problem by adding an (optional) description to the file. Besides that, TeaFiles just store raw structs.

Reading selected fields of such structs does not differ from reading the whole struct when the file is read via memory mapping. For sure, all fields of the structs are mapped and it can be useful to create a derived file that holds only those fields that are often read afterwards.

Reading selected fields without memory mapping is easy by reading the whole struct and then reading the required fields or by creating a reader that skips the non-required fields to avoid composing numbers like a decimal from the raw bytes. That said, TeaFiles use BinaryReader for that purpsoe which is expected to be solid but maybe (reall do not know atm) that can be done faster.

Your detailed report about problems above is a different thing, I hope to find the time to dig into that soon.

discretelogics / TeaFiles.Net-Time-Series-Storage-in-Files