gemstone / pqdif

Gemstone PQDIF Library
https://gemstone.github.io/pqdif/
MIT License
5 stars 4 forks source link

Exception on very large datasets #3

Closed gitkrasty closed 2 years ago

gitkrasty commented 2 years ago

Hello,

With the 1.0.84 version of the PQDIF library, we have reached some limits on the number of exported rows.

image

image

Is this a limit of the file format or it can be longer with another data type (unsigned, long, decimal ...)

What might be the recommended solution in such a scenario?

StephenCWills commented 2 years ago

This is indeed a limitation of the standard. The standard encodes links at the start of every collection element for finding the current element relative to the start of the record as well as the next element after the current element. Because every record body includes a single collection element at the top of the element hierarchy, this effectively limits the uncompressed size of every record in the file.

https://github.com/gemstone/pqdif/blob/f033700ad6d8a7908143ccd885a5d244a121875a/src/Gemstone.PQDIF/Physical/PhysicalWriter.cs#L260-L263

Side note: I think you could potentially argue that the size restriction need not apply to the last collection element in any given collection, but it would perhaps be questionable to assume that any given PQDIF parser would support such a caveat.

It seems reasonable to assume that you are reaching this limit because you are encoding a large amount of data into one or more series instances in an observation. If so, there are a few mechanisms that can hopefully help you to work around this.

  1. Use StorageMethods.Increment for time series where the sampling rate is constant.
  2. Use StorageMethods.Scaled to compress the data in the series instance. For example, you can use 2-byte integers for the series values and 8-byte floating-point numbers for the scale and offset.
  3. Use SeriesInstance.SeriesShareChannelIndex and SeriesInstance.SeriesShareSeriesIndex if you have any duplicate series instances in an observation record. The most common use-case would be a time series where the sampling rate is not constant.
  4. Split the data into multiple observations or PQDIF files. Note that record headers also include a 4-byte link, relative to the start of the file, for finding the next record in the sequence so you will still need to avoid creating PQDIF files in excess of 2 GB. However, the zlib compression may enable you to fit more data into your file before hitting the absolute file size limits.
gitkrasty commented 2 years ago

Thanks for explanations.