datumbox / datumbox-framework

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.
http://www.datumbox.com/
Apache License 2.0
1.09k stars 282 forks source link

Serialize Dataframe #12

Closed shoubhik closed 7 years ago

shoubhik commented 8 years ago

How do I serialize a dataframe efficiently (records in bulk) onto the disk, with mapdb. My use case is, I have a large dataset for text classification, it takes a long time to deserialize and tokenize the text. I want to try out multiple experiments, without having to do tokenization again to convert to Record instances.

shoubhik commented 8 years ago

Looking at the latest code, it seems that whenever Dataframe is created backed by MapDB, it creates a temp DB file. The temp files get deleted whenever the JVM shuts down.

However, there are many use cases where one would like to deserialize a large dataset into a disk backed map. Then they may later try out multiple algorithms on that dataframe. This is especially helpful during experimentation phase if the input dataset takes long to translate into a Dataframe.

What are you thought? If implemented what should the design look like? I can contribute this features if it makes sense.

datumbox commented 8 years ago

What you describing makes sense but unfortunately it is not currently supported. A way around it would be to store the records externally from the Dataframe and then read them and load them back to the Dataframe when necessary.

Adding the feature is not that straightforward, I'll need to check out the design as you said. I'll keep you posted.

datumbox commented 7 years ago

This feature is implemented in the experimental branch. It will be released on version 0.8.0.

datumbox commented 7 years ago

All the tests passed and the feature was added to the develop branch. A stable release is expected within a couple of weeks.