Closed shoubhik closed 7 years ago
Looking at the latest code, it seems that whenever Dataframe is created backed by MapDB, it creates a temp DB file. The temp files get deleted whenever the JVM shuts down.
However, there are many use cases where one would like to deserialize a large dataset into a disk backed map. Then they may later try out multiple algorithms on that dataframe. This is especially helpful during experimentation phase if the input dataset takes long to translate into a Dataframe.
What are you thought? If implemented what should the design look like? I can contribute this features if it makes sense.
What you describing makes sense but unfortunately it is not currently supported. A way around it would be to store the records externally from the Dataframe and then read them and load them back to the Dataframe when necessary.
Adding the feature is not that straightforward, I'll need to check out the design as you said. I'll keep you posted.
This feature is implemented in the experimental branch. It will be released on version 0.8.0.
All the tests passed and the feature was added to the develop branch. A stable release is expected within a couple of weeks.
How do I serialize a dataframe efficiently (records in bulk) onto the disk, with mapdb. My use case is, I have a large dataset for text classification, it takes a long time to deserialize and tokenize the text. I want to try out multiple experiments, without having to do tokenization again to convert to Record instances.