Decoupling DatasetAdapter to the AnalysisManager

Louis-Dupont commented 1 year ago

Motivation

We want to group all the "adapter/processing" logic into a DatasetAdapter class.

Remove the concept of preocessing, preprocessing, ect ... from the AnalysisManager. This will all be handled by the DatasetAdapter.
Ability to use the same object in other context.
- Can be instantiated from cache outside of AnalysisManager and independantly of it.
- Can be used in SG - with some extra changes.

Note There is still some coupling between this DatasetAdapter and the AnalysisManager (no way to escape this) because we want to easily plug the DatasetAdapter to the AnalysisManager. This is handled with the method samples_iterator that returns the samples. I am not 100% sure about this but could not find a better way (open for suggestion)

BloodAxe commented 1 year ago

I don't understand this PR.

For those out of context (me), can you please elaborate on why we need to change current design? It is not clear whan this PR attempts to solve. If it's an enabler for other feature - ok, which one?

I'd love to also see some usage examples - where this new concept is intended for being used.

Louis-Dupont commented 1 year ago

I don't understand this PR.

For those out of context (me), can you please elaborate on why we need to change current design? It is not clear whan this PR attempts to solve. If it's an enabler for other feature - ok, which one?

It is mainly an enabler, but it can almost be seen as a feature.

The target originally is to have a simple way to load a dataset in SG after running the analysis on DG. There are 2 main blocking issues with that

Currently the "Adapter" logic is completly nested inside the AnalysisManager
The DataConfig which holds many dataset attributes, is responsible for asking questions and saving to cache is instantiated

The idea of this PR, is to create a class (DatasetAdapter) that would group the DataConfig and the Adapter/Processing logic, which uses the previous. This would also increase the code cohesion, and (slightly) lower the coupling with the rest of code. This DatasetAdapter will then be instantiable outside of the AnalysisManager.

Scenario 1

Someone ran DG like usually

analyzer = DetectionAnalysisManager(
    report_title="Testing Data-Gradients demo",
    cache_name="MyCustomDataset.json",
    train_data=custom_train_dataset,
    val_data=custom_val_dataset,
    class_names=class_names,
).run()

Then, he can wrap his dataset and benefit from the cached values. The code below would run directly and output images/targets in our format (label_xyxy I think), with image in the right format as well.

train_data = DetectionDatasetAdapter(
    data_iterable=custom_train_dataset,
    cache_filename="MyCustomDataset.json",
)

for image, label_xyxy in train_data:
     ....

This is the target for SG - we would then wrap this in a DG dataset to include all the transform.

Scenario 2

Someone can always use the dataset adapter completely independently of the AnalysisManager

train_data = DetectionDatasetAdapter(data_iterable=custom_dataset) # option to pass extra parameters to be asked less questions

for image, xyxy in train_data:
     ....

Then, on the first iteration, the user will be asked any question that is required to format the image/targets, similarly to what is done when running AnalysisManager.

Scenario 3

Not sure if this is useful, but it would still work

train_data = DetectionDatasetAdapter(
    cache_name="MyCustomTrainSet.json",
    data=custom_train_dataset,
    class_names=class_names,
)

val_data = DetectionDatasetAdapter(
    cache_name="MyCustomValSet.json",
    data=custom_val_dataset,
    class_names=class_names,
)

analyzer = DetectionAnalysisManager(
    report_title="Testing Data-Gradients demo",
    cache_name="MyCustomDataset.json",
    train_data=train_data,
    val_data=val_data,
    class_names=class_names,
).run()


train_data = DetectionDatasetAdapter(data_iterable=custom_dataset)

for image, xyxy in train_data:
     ....

Louis-Dupont commented 1 year ago

Update; just added src/data_gradients/sample_iterables/base.py The motivation was to take out the get_iterator which was defined in the DatasetAdapter and which returned an iterator of ImageSample objects, and instead have a class resonsible to do that. This way, DatasetAdapter has a more clear responsability and that responsaiblity is more clear.

Deci-AI / data-gradients