a π Python framework for generating streams of labelled data
βοΈ Motivation β’ π‘ Idea β’ π¦ Installation β’ π Examples β’ π Documentation β’ π Acknowledgement
Most machine learning systems rely on stationary, labeled, balanced and large-scale datasets. Incremental learning (IL), also referred to as lifelong learning (LL) or continual learning (CL), extends the traditional paradigm to work in dynamic and evolving environments. This requires such systems to acquire and preserve knowledge continually.
Existing CL frameworks like avalanche[^1] or continuum[^2] construct data streams by splitting large datasets into multiple experiences, which has a few disadvantages:
To answer different research questions in the field of CL, researchers need knowledge and control over:
A more economical alternative to collecting and labelling streams with desired properties is the generation of synthetic streams[^6]. Some mentionable efforts in that direction include augmentation based dataset generation like ImageNet-C[^3] or simulation-based approaches like the EndlessCLSim[^4], where semantically labeled street-view images are generated (and labeled) by a game engine, that procedurally generates the city environment and simulates drift by modifying parameters (like the weather and illumination conditions) over time.
This project builds on these ideas and presents a general framework for generating streams of labeled samples.
This section introduces the main ideas and building blocks of the streamgen
framework.
There exists only a limited number of distributions one can directly sample from (e.g.: a gaussian distribution).
Instead of generating samples directly from a distribution, researchers often work with collected sets of samples. A common practice to increase the variability of such datasets is the use of stochastic transformations in a sequential augmentation pipeline:
from torchvision.transforms import v2
transforms = v2.Compose([
v2.RandomResizedCrop(size=(224, 224), antialias=True),
v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
# ...
])
while generating_data:
# option 1 - sample from a dataset
sample = np.random.choice(dataset)
# option 2 - sample from a distribution
sample = np.random.randn(...)
augmented_sample = transforms(sample)
Combined with an initial sampler, that either samples from a data set or directly from a distribution, these chained transformations can represent complex distributions.
One shortcoming of this approach is that one can only generate samples from a single distribution -> different class distributions are not representable.
One solution to this problem is the use of a tree (or other directed acyclic graph (DAG)) data structure to store the transformations.
If we want to model evolving distributions (streams), we either need to change the parameters of the stochastic transformations or the topology of the tree over time.
Currently, streamgen
does not support scheduling topological changes (like adding branches and nodes), but by unrolling these changes over time into one static tree, topological changes can be modelled purely with branch probabilities.
π‘ the directed acyclic graph above is not a tree anymore due to the merging of certain branches. Because these merges are very convenient in certain scenarios,
streamgen
support the definition of such trees by copying the paths below the merge to every branch before the merge. For an example of this, have a look atexamples/time series classification/04-multi-label-generation.ipynb
.
The proposed tree structure can model all three common data drift scenarios by scheduling the parameters of the transformations at specific nodes.
The graph visualizations require Graphviz to be installed on your system. Depending on your operating system and package manager, you might try one of the following options:
sudo apt-get install graphviz
choco install graphviz
brew install graphviz
The basic version of the package can be installed from PyPi with:
pip install streamgen
streamgen
provides a few (pip) extras:
extras group | needed for | additional dependencies |
---|---|---|
examples | running the example notebooks with their application specific dependencies | perlin-numpy , polars |
cl | continual learning frameworks | continuum |
all | shortcut for installing every extra | * |
To install the package with specific extras execute:
pip install streamgen[<name_of_extra>]
π§βπ» to install a development environment (which you need if you want to work on the package, instead of just using the package),
cd
into the project's root directory and call:poetry install --sync --compile --all-extras
There are example notebooks πͺπ showcasing and explaining streamgen
features:
Here is a preview of what we will create in the time series examples:
The documentation is hosted through github pages.
To locally build and view it, call poe docs_local
.
Made with β€οΈ and β by Laurenz Farthofer.
This work was funded by the Austrian Research Promotion Agency (FFG, Project No. 905107).
Special thanks to Benjamin Steinwender, Marius Birkenbach and Nikolaus Neugebauer for their valuable feedback.
I want to thank Infineon and KAI for letting me work on and publish this project.
Finally, I want to thank my university supervisors Thomas Pock and Marc Masana for their guidance.
The art in the banner of this README is licensed under a Creative Commons Attribution-NonCommercial-No Derivatives Works 3.0 License. It was made by th3dutchzombi3. Check out his beautiful artwork β€οΈ
[^1]: V. Lomonaco et al., βAvalanche: an End-to-End Library for Continual Learning,β in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA: IEEE, Jun. 2021, pp. 3595β3605. doi: 10.1109/CVPRW53098.2021.00399. [^2]: A. Douillard and T. Lesort, βContinuum: Simple Management of Complex Continual Learning Scenarios.β arXiv, Feb. 11, 2021. doi: 10.48550/arXiv.2102.06253. [^3]: D. Hendrycks and T. Dietterich, βBenchmarking Neural Network Robustness to Common Corruptions and Perturbations.β arXiv, Mar. 28, 2019. doi: 10.48550/arXiv.1903.12261. [^4]: T. Hess, M. Mundt, I. Pliushch, and V. Ramesh, βA Procedural World Generation Framework for Systematic Evaluation of Continual Learning.β arXiv, Dec. 13, 2021. doi: 10.48550/arXiv.2106.02585. [^5]: Wu, Ming-Ju, Jyh-Shing R. Jang, and Jui-Long Chen. βWafer Map Failure Pattern Recognition and Similarity Ranking for Large-Scale Data Sets.β IEEE Transactions on Semiconductor Manufacturing 28, no. 1 (February 2015): 1β12. [^6]: J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, βLearning under Concept Drift: A Reviewβ IEEE Trans. Knowl. Data Eng., pp. 1β1, 2018, doi: 10.1109/TKDE.2018.2876857. [^7]: βFunction composition,β Wikipedia. Feb. 16, 2024. Accessed: Apr. 17, 2024. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Function_composition&oldid=1207989326