databento / dbn

Databento Binary Encoding (DBN) - Fast message encoding and storage format for market data
https://databento.com
Apache License 2.0
86 stars 8 forks source link

Allow concatenation of several DBN files in a single data stream #54

Open schrodervictor opened 4 months ago

schrodervictor commented 4 months ago

Feature Request: Concatenation of DBN files in a single stream

When working with files with high granularity, it is common to have directories with hundreds or even thousands of DBN files. In such situations, one file is the timeseries continuation of the previous one. Very often, there is a need to combine the contents of these files in a single stream of data, e.g., proper windowing of data, to serve as input to a machine learning training task, etc.

While experimenting with the dbn CLI tool and with the Python library, I wasn't able to find any commands or helper functions to achieve this result easily. The workaround was to load the DBN files one by one, convert each one to a Pandas DataFrame and use the pd.concat function to merge them all into a single DataFrame. However, this process is slow, memory intensive and involves the creation of multiple intermediary Pandas DataFrames, just to have one single stream at the end. Also, because the data has to be converted into a Pandas DataFrame, all the benefits of DBN files are not available in such situation.

Current behavior

Trying to use the CLI passing multiple input files and a single output destination is not supported:

$ dbn ./file-00.dbn.zst ./file-01.dbn.zst --out combined.dbn
error: unexpected argument './file-01.dbn.zst' found

Trying to load several files at once from the Python library is also not supported:

>>> import databento as db
>>> data = db.DBNStore.from_file('./*.dbn.zst')

FileNotFoundError                         Traceback (most recent call last)
Cell In[1], line 4
      1 import os
      2 import databento as db
----> 4 data = db.DBNStore.from_file('./*.dbn.zst')

File /opt/conda/.../python3.11/site-packages/databento/common/dbnstore.py:649, in DBNStore.from_file(cls, path)
    627 @classmethod
    628 def from_file(cls, path: PathLike[str] | str) -> DBNStore:
    629     """
    630     Load the data from a DBN file at the given path.
    631 
   (...)
    647 
    648     """
--> 649     return cls(FileDataSource(path))

File /opt/conda/.../python3.11/site-packages/databento/common/dbnstore.py:145, in FileDataSource.__init__(self, source)
    142 self._path = Path(source)
    144 if not self._path.is_file() or not self._path.exists():
--> 145     raise FileNotFoundError(source)
    147 if self._path.stat().st_size == 0:
    148     raise ValueError(
    149         f"Cannot create data source from empty file: {self._path.name}",
    150     )

FileNotFoundError: ./*.dbn.zst

Same with an array of files:

>>> import databento as db
>>> data = db.DBNStore.from_file(['./file-00.dbn.zst', './file-01.dbn.zst'])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 4
      1 import os
      2 import databento as db
----> 4 data = db.DBNStore.from_file(['./file-00.dbn.zst', './file-01.dbn.zst'])

File /opt/conda/.../python3.11/site-packages/databento/common/dbnstore.py:649, in DBNStore.from_file(cls, path)
    627 @classmethod
    628 def from_file(cls, path: PathLike[str] | str) -> DBNStore:
    629     """
    630     Load the data from a DBN file at the given path.
    631 
   (...)
    647 
    648     """
--> 649     return cls(FileDataSource(path))

File /opt/conda/.../python3.11/site-packages/databento/common/dbnstore.py:142, in FileDataSource.__init__(self, source)
    141 def __init__(self, source: PathLike[str] | str):
--> 142     self._path = Path(source)
    144     if not self._path.is_file() or not self._path.exists():
    145         raise FileNotFoundError(source)

File /opt/conda/.../python3.11/pathlib.py:871, in Path.__new__(cls, *args, **kwargs)
    869 if cls is Path:
    870     cls = WindowsPath if os.name == 'nt' else PosixPath
--> 871 self = cls._from_parts(args)
    872 if not self._flavour.is_supported:
    873     raise NotImplementedError("cannot instantiate %r on your system"
    874                               % (cls.__name__,))

File /opt/conda/.../python3.11/pathlib.py:509, in PurePath._from_parts(cls, args)
    504 @classmethod
    505 def _from_parts(cls, args):
    506     # We need to call _parse_args on the instance, so as to get the
    507     # right flavour.
    508     self = object.__new__(cls)
--> 509     drv, root, parts = self._parse_args(args)
    510     self._drv = drv
    511     self._root = root

File /opt/conda/.../python3.11/pathlib.py:493, in PurePath._parse_args(cls, args)
    491     parts += a._parts
    492 else:
--> 493     a = os.fspath(a)
    494     if isinstance(a, str):
    495         # Force-cast str subclasses to str (issue #21127)
    496         parts.append(str(a))

TypeError: expected str, bytes or os.PathLike object, not list

Expected behavior

The commands and function calls above should work as intended, meaning:

Added Value

If the dbn command line tool provides an easy way to convert multiple DBN files into a single one, the issue reported above can be easily solved by a very simple preprocessing step where all the necessary files are merged, so they can be later loaded as a single stream (for example, in a Python application).

If the library functions are adapted to load multiple files at once, the benefits are even greater, as the final result would be achievable from the programming language itself.

threecgreen commented 4 months ago

Hi, We have a roadmap item for supporting merging DBN files in the CLI and client libraries.

You can create a memory-efficient stream by chaining iterators over the separate DBNStores like so:

from itertools import chain

for record in chain(DBNStore.from_file('file1.dbn'), DBNStore.from_file('file2.dbn')):
    foo(record)

With glob.glob() and some file name sorting, you can get a lot of what you suggested.