elki-project / elki

ELKI Data Mining Toolkit
https://elki-project.github.io/
GNU Affero General Public License v3.0
780 stars 321 forks source link

Non-Serializable classes #44

Open valera7979 opened 6 years ago

valera7979 commented 6 years ago

It would be nice to add serialization in classes. In particular, to save cluster models

kno10 commented 6 years ago

Actually I do not think there is much use in serialization of cluster models. They are not predictive models that you would "deploy" to a "production pipeline", like a classifier.

But I agree that in general, it would be nice to have efficient serialization support. But this is a lot of very boring work, and we do not have volunteers to do this. So it is of very low priority and is likely not going to happen.

valera7979 commented 6 years ago

Thanks. About of cluster models serialization. I worked on a task where I had to train a model and then in another task I compared the data with the model created earlier. Because there was no serialization, I had to save the points entering into clusters, and then restore the model to outliers detection. So I think the serialization of the cluster model is also useful.

kno10 commented 6 years ago

The difficulties with a general solution are that the clusterings do not have the data. They only have the object IDs. And these are not persistent. So any serializer would likely have to "join" the clusters with the original data. At which point it becomes a huge blob to serialize, and for many applications you are much better off with just using your own serialization with exactly the format and data parts (coordinates, labels, identifiers such as file names - there could be arbitrary complex data associated with each object ID) that you need. For many clustering algorithms, you do not have much more than the object IDs (except k-means, where you have cluster means). And this variability makes any generic serialization a real pain to design, and likely to break all the time.