colabfit / colabfit-tools

Tools for constructing and manipulating datasets for fitting interatomic potentials
Other
6 stars 2 forks source link

Change how transformations are applied #4

Closed jvita closed 2 years ago

jvita commented 2 years ago

The current method of applying transformations to a dataset with apply_transform() seems dangerous. In particular, it seems problematic because it would be easy to have a line in a script that repeatedly modifies the data every time the script is run.

A better idea might be to let get_data() (and all related functions, like plot_histograms() take in a transformations argument that applies the transformations to the data before it returns it. This is safer because it isn't modifying the underlying Mongo database, but would be slower because it has to re-apply the transformations on every call of get_data().

jvita commented 2 years ago

Alternatively, just get rid of apply_transforms() completely and leave it up to the user to transform their data...

mjwen commented 2 years ago

Agreed, not a good idea to modify the underlying mongo data. But having the possibility to apply user-provided transformation seems good. So, I prefer your idea to allow it via get_data(). Maybe we can even ask a user to provide the transform function(s) at the instantiation of the dataset, like what torch vision does?

dskarls commented 2 years ago

Maybe the simplest way to support transformations would be to have them be functions that take Configuration objects as inputs and either mutate them in-place or return new, transformed Configuration objects as output (if you require more structure, have them be classes with such a method defined on them). At least, this approach makes sense if the only thing you need to perform a single transformation is a Configuration and none of the additional data structure in mongo is relevant. Alterations to mongo would all have to be done manually by the user this way, which is an inconvenience but obviously prevents the gotchas. Any thoughts?

jvita commented 2 years ago

My thoughts on the above comments:

dskarls commented 2 years ago

@jvita I'd be ok with doing away with transformations and re-adding something similar in the future upon community demand. My point was they're something that should be defined at the finest-grained level possible, which in this case is Configuration objects. They shouldn't have any knowledge of or access to what a Property Definition is or any of the other data structures or mongo stuff. If you wanted, you could have them be methods of the Configuration class or define them as standalone functions in a module somewhere. However, it's practically raw data manipulation at that point so I'm not sure there are enough commonly used transformations to make it worth the hassle.

jvita commented 2 years ago

I added a transform argument to insert_data which allows the user to provide a function which will be called to modify a Configuration in-place before it's added to the database. As @dskarls pointed out this is little more than raw data manipulation, but it makes it easier to use insert_data's built-in parallelization rather than a user having to parallelize the transformation themselves.