Closed pavlexander closed 10 months ago
``Few things...
Your candle is a struct - copying it to to Memory<T>
(let alone onto MemoryStream
) when you can use the non-async Serialize
method to get a stack based Span<byte>
will be significantly faster vs allocating a pointer on the heap.
You're copying around the serialized bytes to the heap and subsequently copy 3 more times unnecessarily. You pass Memory<byte>
to MemoryStream
which data from one array to another array for no reason, and then you're copying the entire stream again with ToArray
- iterating over the stream's internal array byte-by-byte making a copy of yet another array that gets passed to 'File.WriteAllBytes. **If you just use a FileStream (e.g.
FileStream fs = File.OpenWrite()) and
Span- and the call
fs.Write(bytes)` it eliminates all the copying and leverages the Formula 1 speed of Span
The most important point is that you don't really need to serialize each array item one-by-one. Just pass the array to the Serialize
function - like this example.
In related news, I copied your code from above into Linqpad (see below), ran some benchmarks (see below) and provided both for reference. If you use Linqpad, you can just add the NuGet packages, copy the code into a "Program" script, and finally highlight the functions you want to benchmark and hit "ctrl+shift+b." I added a class version of a candle for further comparison vs TeaCup (only supports structs) and HyperSerializer still 33% faster (see bottom benchmark line. This is an important point because classes tend to be more suitable for stream processing and ML trading algos).
All benchmarks are 10M randomly generated candle objects (structs and/or classes).
_I'm going write a separate post to address some of your other questions which are equally if not more important asap. Here's the Linqpad script...
Linqpad "Program" script type...add HyperSerializer and Teacup references. Teacup's bits are still .Net Framework Linqpad (and .Net in general) doesn't play well with...
async Task Main()
{
}
#region Highlight and CTRL + SHIFT + B
void Tea_ForEach_Struct_10M()
{
var path = @"e:\temp\teaFile6.tea";
using (var tf = TeaFile<CandlestickLongStruct>.Create(path))
{
foreach (var item in Bars)
{
tf.Write(item);
}
}
File.Delete(path);
}
void HyperSerializer_ForEach_Struct_FileStream_10M()
{
var path = @"e:\temp\hyper.bin";
using (var fs = File.OpenWrite(path))
{
foreach (var item in Bars)
{
Span<byte> bytes = Hyper.HyperSerializer.Serialize(item);
fs.Write(bytes);
}
}
}
void HyperSerializer_NoLoop_FileStream_10M()
{
var path = @"e:\temp\hyper.bin";
using (var fs = File.OpenWrite(path))
{
Span<byte> bytes = Hyper.HyperSerializer.Serialize(Bars);
fs.Write(bytes);
}
File.Delete(path);
}
void HyperSerializer_ForEach_Class_FileStream_10M()
{
var path = @"e:\temp\hyper.bin";
using (var fs = File.OpenWrite(path))
{
foreach (var item in BarsClass)
{
var bytes = Hyper.HyperSerializer.Serialize(item);
fs.Write(bytes);
}
}
File.Delete(path);
}
#endregion
#region Benchark Data
Random rand = new Random();
CandlestickLongStruct[] _bars;
public CandlestickLongStruct[] Bars => _bars ??=
Enumerable.Range(0, 10_000_000).Select(x =>
new CandlestickLongStruct
{
High = rand.Next(),
Open = rand.Next(),
Close = rand.Next(),
Low = rand.Next()
})
.ToArray();
CandlestickLongClass[] _barsClass;
public CandlestickLongClass[] BarsClass => _barsClass ??=
Enumerable.Range(0, 10_000_000).Select(x =>
new CandlestickLongClass
{
High = rand.Next(),
Open = rand.Next(),
Close = rand.Next(),
Low = rand.Next()
})
.ToArray();
public struct CandlestickLongStruct
{
public int High { get; set; }
public int Low { get; set; }
public int Close { get; set; }
public int Open { get; set; }
}
public class CandlestickLongClass
{
public int High { get; set; }
public int Low { get; set; }
public int Close { get; set; }
public int Open { get; set; }
}
#endregion
additionally I would like to know how would you solve the problem of appending the data to an existing file and also getting/upating a total number of records in file.
I have a private repo with a bunch of this stuff implemented. If you send me some detail regarding what you're working on and goals / objectives (data sources, exchanges, brokers, asset classes, strategy, etc.)....we may be able to help each other out.
Hi!
Thank you very much for a very detailed answer!
Actually, the reason why I am looking for teafiles
replacement is for that exact reason that I want to use classes and this extra bit of struct<->class
mapping that I use currently does not add any value.. The HyperSerializer
has an advantage in that regards..
Regarding your suggested solution to just serialize the whole batch of data - I've had troubles when trying to serialize it one go
var allData = Hyper.HyperSerializer.Serialize(dataStructLong);
Surprisingly, no such issue occurs if SymbolTick
from the tests is used!
Hyper.HyperSerializer.Serialize(ticks);
Turns out the issue is with the Ienumerable type used! Somehow if I use the array
then serialization works, if I use the list
- it throws, e.g.
Hyper.HyperSerializer.Serialize(dataStructLong.Take(1).ToArray()); // no exception
Hyper.HyperSerializer.Serialize(dataStructLong.Take(1).ToList()); // exception
The second issue I've identified is connected to HyperSerializer
warmup.
For example, the following code runs 1420 ms in release mode (teatime
540 ms)
var sw = Stopwatch.StartNew();
using (var fs = File.OpenWrite(customHyperFilePath))
{
foreach (var item in dataStructLong)
{
fs.Write(Hyper.HyperSerializer.Serialize(item));
}
}
sw.Stop();
var elapsedHyper = sw.ElapsedMilliseconds;
but with the following line added before the stopwatch - the HyperSerializer
execution time is only 547 ms (teatime
is 553 ms).
Hyper.HyperSerializer.Serialize(dataStructLong.First());
It means that HyperSerializer
performs better after it's been "run" previously. Could you comment on that?
Finally, in regards to your question about the use-case.. I'am simply collecting the OHLC data from popular exchanges (currently only 1) and trying to come up with some solution to persist the data.. After a bunch of tests I've figured that it's easier and better to store the data in binary files through the means of libraries such as teafiles
or the HyperSerializer
.
They perform substantially better than SQL
, or, god forbid, the JSON
or CSV
. Not only the read/writes are multiverse faster, but all other metrics are better as well (memory footprint, disk space usage, etc.).
Here are some results from previous tests (they did not include HS at that point)
File write performance (Serialization)
File read performance (De-Serialization)
The use-case actually involves saving the huge chunk of data once, then appending fresh data to the same files on a daily basis. All's done synchronously. No frequent read-access is required either.
To make the long story short, I just want to drop the tea-files library due to it's inability use the classes. This causes some headaches on the backend service I am building :)
Regarding the warmup, the first time that HyperSerializer is used to serializer or deserialize, it generates dynamic in memory assembly containing a type that's optimized to serialize the object. Just create an initialization function in you application startup (program.cs or wherever) that makes a call using each type that you want to serialize.
Regarding how to store the serialized objects you have a few options:
I highly suggest the last option. I use Microsoft FASTER w/HyperSerializer. It's a complicated beast but very powerful and the fastest KV store I found by several orders of magnitude. It's a memory mapped file store and has the flexibility to be used as a multi-value dictionary with a native mechanism that allows for key -> values chaining. In other words, each time you add a value (called an upsert) to the store, you can configure it's call back functions to store the new record at a new memory address to the key without overwriting the existing value, and create a reference chain to all prior values for the same key. It allows you to read, write and update values in terabyte files in microseconds.
Using HyperSerializzer w/ FASTER, I can write 10 million records to SSD in about 3 seconds. Happy to give you access to my private repo if you want to take a look at how I used it. LMK.
I would like to use this library to save/append the candlestick trading data (OHLCV) into files.
After looking at advertised performance I have tried using this library, but it's just not as fast as simply saving the data to a binary file.. I would like to know if I am misusing the library or it's simply not meant for the given use-case? The test data set consists of 3_020_871 records.
the test code looks like this
for performance comparison, I am also using the
teafiles
library (which basically is a wrapper for brute-force binary serialization):the results are:
it does seems like the
HyperSerializer
produces almost exact size output file, but the performance is much worse.The aim of this post of course, not to compare this lib to others. I genuinely want to replace the teafiles library and looking for a better solution. I would appreciate to hear a feedback on the performance issue..
for the sake of completeness, here's the serialized data type
.net 7
,HyperSerializer 1.4.0
,Rubble.TeaFiles.Net 2.0.0
additionally I would like to know how would you solve the problem of appending the data to an existing file and also getting/upating a total number of records in file.