Closed jvita closed 2 years ago
Alternatively, just get rid of apply_transforms()
completely and leave it up to the user to transform their data...
Agreed, not a good idea to modify the underlying mongo data. But having the possibility to apply user-provided transformation seems good. So, I prefer your idea to allow it via get_data()
. Maybe we can even ask a user to provide the transform function(s) at the instantiation of the dataset, like what torch vision does?
Maybe the simplest way to support transformations would be to have them be functions that take Configuration
objects as inputs and either mutate them in-place or return new, transformed Configuration
objects as output (if you require more structure, have them be classes with such a method defined on them). At least, this approach makes sense if the only thing you need to perform a single transformation is a Configuration
and none of the additional data structure in mongo is relevant. Alterations to mongo would all have to be done manually by the user this way, which is an inconvenience but obviously prevents the gotchas. Any thoughts?
My thoughts on the above comments:
insert_dataset()
function, which just adds a document into the Mongo Database with pointers to the linked ConfigurationSets and Properties. insert_dataset()
will only ever be called once the Properties already exist in the Database, therefore it wouldn't be a good point to apply transformations.insert_data()
is when the Configurations and Properties are added into the Database. This could be a good time to apply transformations, and would be most similar to @dskarls's suggestion. However, this functionality would already be easy for a user to handle by writing a function that iterates over the generator/list returned by load_data()
, applies the transformations in-place, and yields/returns the modified Configuration. I'm not against adding this functionality as a built-in for insert_data()
though, as some users might find it convenient.apply_transformations()
shouldn't exist, and get_data()
should NOT accept a transformations
argument. My rationale behind removing apply_transformations()
is that once a Property is in the Mongo Database it's tied directly to a Property Definition, and the Property shouldn't be allowed to be transformed without updating the Property Definition that it's tied to. In the future, apply_transformations()
could be re-written to take in new Property Definitions and perform these updates, but I don't think that's necessary right now. Better to encourage writing good Property Definitions and making sure that your data matches those definitions before calling insert_data()
. The rational for not adding a transformations
argument to get_data()
is that if a user wants to mess with the data, they should do it completely independently of colabfit-tools -- for example, by wrapping the Mongo client in a PyTorch Dataset, which could then be given its own transformation functions.@jvita I'd be ok with doing away with transformations and re-adding something similar in the future upon community demand. My point was they're something that should be defined at the finest-grained level possible, which in this case is Configuration
objects. They shouldn't have any knowledge of or access to what a Property Definition is or any of the other data structures or mongo stuff. If you wanted, you could have them be methods of the Configuration
class or define them as standalone functions in a module somewhere. However, it's practically raw data manipulation at that point so I'm not sure there are enough commonly used transformations to make it worth the hassle.
I added a transform
argument to insert_data
which allows the user to provide a function which will be called to modify a Configuration in-place before it's added to the database. As @dskarls pointed out this is little more than raw data manipulation, but it makes it easier to use insert_data
's built-in parallelization rather than a user having to parallelize the transformation themselves.
The current method of applying transformations to a dataset with
apply_transform()
seems dangerous. In particular, it seems problematic because it would be easy to have a line in a script that repeatedly modifies the data every time the script is run.A better idea might be to let
get_data()
(and all related functions, likeplot_histograms()
take in atransformations
argument that applies the transformations to the data before it returns it. This is safer because it isn't modifying the underlying Mongo database, but would be slower because it has to re-apply the transformations on every call ofget_data()
.