Deci-AI / data-gradients

Computer Vision dataset analysis
Apache License 2.0
293 stars 33 forks source link

Decoupling DatasetAdapter to the AnalysisManager #163

Closed Louis-Dupont closed 1 year ago

Louis-Dupont commented 1 year ago

Motivation

We want to group all the "adapter/processing" logic into a DatasetAdapter class.

Note There is still some coupling between this DatasetAdapter and the AnalysisManager (no way to escape this) because we want to easily plug the DatasetAdapter to the AnalysisManager. This is handled with the method samples_iterator that returns the samples. I am not 100% sure about this but could not find a better way (open for suggestion)

BloodAxe commented 1 year ago

I don't understand this PR.

For those out of context (me), can you please elaborate on why we need to change current design? It is not clear whan this PR attempts to solve. If it's an enabler for other feature - ok, which one?

I'd love to also see some usage examples - where this new concept is intended for being used.

Louis-Dupont commented 1 year ago

I don't understand this PR.

For those out of context (me), can you please elaborate on why we need to change current design? It is not clear whan this PR attempts to solve. If it's an enabler for other feature - ok, which one?

It is mainly an enabler, but it can almost be seen as a feature.

The target originally is to have a simple way to load a dataset in SG after running the analysis on DG. There are 2 main blocking issues with that

The idea of this PR, is to create a class (DatasetAdapter) that would group the DataConfig and the Adapter/Processing logic, which uses the previous. This would also increase the code cohesion, and (slightly) lower the coupling with the rest of code. This DatasetAdapter will then be instantiable outside of the AnalysisManager.

Scenario 1

Someone ran DG like usually

analyzer = DetectionAnalysisManager(
    report_title="Testing Data-Gradients demo",
    cache_name="MyCustomDataset.json",
    train_data=custom_train_dataset,
    val_data=custom_val_dataset,
    class_names=class_names,
).run()

Then, he can wrap his dataset and benefit from the cached values. The code below would run directly and output images/targets in our format (label_xyxy I think), with image in the right format as well.

train_data = DetectionDatasetAdapter(
    data_iterable=custom_train_dataset,
    cache_filename="MyCustomDataset.json",
)

for image, label_xyxy in train_data:
     ....

This is the target for SG - we would then wrap this in a DG dataset to include all the transform.

Scenario 2

Someone can always use the dataset adapter completely independently of the AnalysisManager

train_data = DetectionDatasetAdapter(data_iterable=custom_dataset) # option to pass extra parameters to be asked less questions

for image, xyxy in train_data:
     ....

Then, on the first iteration, the user will be asked any question that is required to format the image/targets, similarly to what is done when running AnalysisManager.

Scenario 3

Not sure if this is useful, but it would still work

train_data = DetectionDatasetAdapter(
    cache_name="MyCustomTrainSet.json",
    data=custom_train_dataset,
    class_names=class_names,
)

val_data = DetectionDatasetAdapter(
    cache_name="MyCustomValSet.json",
    data=custom_val_dataset,
    class_names=class_names,
)

analyzer = DetectionAnalysisManager(
    report_title="Testing Data-Gradients demo",
    cache_name="MyCustomDataset.json",
    train_data=train_data,
    val_data=val_data,
    class_names=class_names,
).run()

train_data = DetectionDatasetAdapter(data_iterable=custom_dataset)

for image, xyxy in train_data:
     ....
Louis-Dupont commented 1 year ago

Update; just added src/data_gradients/sample_iterables/base.py The motivation was to take out the get_iterator which was defined in the DatasetAdapter and which returned an iterator of ImageSample objects, and instead have a class resonsible to do that. This way, DatasetAdapter has a more clear responsability and that responsaiblity is more clear.