Open schrodervictor opened 4 months ago
Hi, We have a roadmap item for supporting merging DBN files in the CLI and client libraries.
You can create a memory-efficient stream by chaining iterators over the separate DBNStores like so:
from itertools import chain
for record in chain(DBNStore.from_file('file1.dbn'), DBNStore.from_file('file2.dbn')):
foo(record)
With glob.glob()
and some file name sorting, you can get a lot of what you suggested.
Feature Request: Concatenation of DBN files in a single stream
When working with files with high granularity, it is common to have directories with hundreds or even thousands of DBN files. In such situations, one file is the timeseries continuation of the previous one. Very often, there is a need to combine the contents of these files in a single stream of data, e.g., proper windowing of data, to serve as input to a machine learning training task, etc.
While experimenting with the
dbn
CLI tool and with the Python library, I wasn't able to find any commands or helper functions to achieve this result easily. The workaround was to load the DBN files one by one, convert each one to a Pandas DataFrame and use thepd.concat
function to merge them all into a single DataFrame. However, this process is slow, memory intensive and involves the creation of multiple intermediary Pandas DataFrames, just to have one single stream at the end. Also, because the data has to be converted into a Pandas DataFrame, all the benefits of DBN files are not available in such situation.Current behavior
Trying to use the CLI passing multiple input files and a single output destination is not supported:
Trying to load several files at once from the Python library is also not supported:
Same with an array of files:
Expected behavior
The commands and function calls above should work as intended, meaning:
DBNStore.from_file
in Python should accept either a glob pattern or a list of filenames from the file system, exposing in return a single stream of data from all the matching files in the provided sequenceAdded Value
If the
dbn
command line tool provides an easy way to convert multiple DBN files into a single one, the issue reported above can be easily solved by a very simple preprocessing step where all the necessary files are merged, so they can be later loaded as a single stream (for example, in a Python application).If the library functions are adapted to load multiple files at once, the benefits are even greater, as the final result would be achievable from the programming language itself.