ATOMScience-org / AMPL

The ATOM Modeling PipeLine (AMPL) is an open-source, modular, extensible software pipeline for building and sharing models to advance in silico drug discovery.
MIT License
136 stars 68 forks source link

Load pre-calculated features with embedded models #301

Closed stewarthe6 closed 4 months ago

stewarthe6 commented 5 months ago

This allows AMPL to use pre-calculated features with embedded models and transfer learning. I created 3 classes to accomplish this.

  1. EmbeddingDataset: This overrides get_featurized_data and save_featurized_data. This dataset is meant to be exclusively used with EmbeddingFeaturization. It creates an second, member dataset that loads/calculates features that are used as input into the embedding model and then generates the embedded features. The save_featurized_data function does nothing, since it cannot save embedded features. However it can save features for the member dataset.
  2. FileEmbeddingDataset: This inherits EmbeddingDataset and FileDataset. This is used when the input features come from a file.
  3. DatastoreEmbeddingDataset: This inherits EmbeddingDataset and DatastoreDataset and is used when input features come from the datastore. I don't test this class, I don't have a good test case.

featurize_data in EmbeddingFeaturization no long needs to rename response columns.