A Python library defining data structures optimized for machine learning pipelines
cgnal-core is a Python package with modular design that provides powerful abstractions to build data ingestion pipelines and run end to end machine learning pipelines. The library offers lightweight object-oriented interface to MongoDB as well as Pandas based data structures. The aim of the library is to provide extensive support for developing machine learning based applications with a focus on practicing clean code and modular design.
Some cool features that we are proud to mention are:
Offers the following data structures:
From pypi server
pip install cgnal-core
From source
git clone https://github.com/CGnal/cgnal-core
cd cgnal-core
make install
make tests
To run predefined checks (unit-tests, linting checks, formatting checks and static typing checks):
make checks
Creating a Database of Table objects
import pandas as pd
from cgnal.core.data.layer.pandas.databases import Database
# sample df
df1 = pd.DataFrame([[1, 2, 3], [6, 5, 4]], columns=['a', 'b', 'c'])
# creating a database
db = Database('/path/to/db')
table1 = db.table('df1')
# write table to path
table1.write(df1)
# get path
print(table1.filename)
# convert to pandas dataframe
table1.to_df()
# get table from database
db.__getitem__('df1')
Using an Archiver with Dao objects
from cgnal.core.data.layer.pandas.archivers import CsvArchiver
from cgnal.core.data.layer.pandas.dao import DataFrameDAO
# create a dao object
dao = DataFrameDAO()
# create a csv archiver
arch = CsvArchiver('/path/to/csvfile.csv', dao)
# get pandas dataframe
print(arch.data.head())
# retrieve a single document object
doc = next(arch.retrieve())
# retrieve a list of document objects
docs = [i for i in arch.retrieve()]
# retrieve a document by it's id
arch.retrieveById(doc.uuid)
# archive a single document
doc = next(arch.retrieve())
# update column_name field of the document with the given value
doc.data.update({'column_name': 'VALUE'})
# archive the document
arch.archiveOne(doc)
# archive list of documents
arch.archiveMany([doc, doc])
# get a document object as a pandas series
arch.dao.get(doc)
Creating a PandasDataset object
import pandas as pd
import numpy as np
from cgnal.core.data.model.ml import PandasDataset
dataset = PandasDataset(features=pd.concat([pd.Series([1, np.nan, 2, 3], name="feat1"),
pd.Series([1, 2, 3, 4], name="feat2")], axis=1),
labels=pd.Series([0, 0, 0, 1], name="Label"))
# access features as a pandas dataframe
print(dataset.features.head())
# access labels as pandas dataframe
print(dataset.labels.head())
# access features as a python dictionary
dataset.getFeaturesAs('dict')
# access features as numpy array
dataset.getFeaturesAs('array')
# indexing operations
# access features and labels at the given index as a pandas dataframe
print(dataset.loc([2]).features.head())
print(dataset.loc([2]).labels.head())
Creating a PandasTimeIndexedDataset object
import pandas as pd
import numpy as np
from cgnal.core.data.model.ml import PandasTimeIndexedDataset
dateStr = [str(x) for x in pd.date_range('2010-01-01', '2010-01-04')]
dataset = PandasTimeIndexedDataset(
features=pd.concat([
pd.Series([1, np.nan, 2, 3], index=dateStr, name="feat1"),
pd.Series([1, 2, 3, 4], index=dateStr, name="feat2")
], axis=1))
We are very much willing to welcome any kind of contribution whether it is bug report, bug fixes, contributions to the existing codebase or improving the documentation.
Please look at the Github issues tab to start working on open issues
Please make sure the general guidelines for contributing to the code base are respected