facebookresearch / vissl

VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.
https://vissl.ai
MIT License
3.24k stars 330 forks source link

Multi dataset support #537

Open Giaco90 opened 2 years ago

Giaco90 commented 2 years ago

Hello all, I'm trying to pretrain an AE model using multiple datasets.

According to documentation VISSL support the use of multiple datasets, but it seems to not work properly:

...
for idx in indices:
    output = self.transform(sample["data"][idx])
    if self._is_transform_with_labels():
        sample["data"][idx] = output[0]
        sample["label"][-1] = output[1]
    else:
        sample["data"][idx] = output
...

that -1 will break the label assignment because if sample["label"] is initialised (and it is, because we can only use sample_index or zero as LABEL_TYPE) it will ever overwrite the last one. Why it is not indexed with 'idx' as for sample["data"]?

At this point I think I can:

Is this correct? Or am I missing something? Do you have any suggestion? Is the multi-dataset really supported or I miss understood it?

Thanks for your answer and sorry for the long post :D

Giacomo

EDIT: I think I found another possible bug when using dataset-specific transformations. When I specify "indices" in the yaml conf I get the following error:

TypeError: __init__() got multiple values for argument 'indices'

I think the problem is in the from_config of the transform wrapper here, the config.get should be a config.pop

QuentinDuval commented 2 years ago

Hi @Giaco90,

First of all, thanks a lot for your interest in VISSL.

I had a look at the code and it seems that the multi-dataset support of VISSL is really lacking indeed. From what I understand, the current design allows to:

        item = {"data": [], "data_valid": [], "data_idx": []}
        for data_source in self.data_objs:
            data, valid = data_source[subset_idx]
            item["data"].append(data)
            item["data_idx"].append(idx)
            item["data_valid"].append(1 if valid else -1)

I don't think this use case is actually super useful. I would think that we should instead have the two datasets being concatenated and so we would end up with a dataset of increased length. That's something that could be introduced (we could add a configuration somewhere to change the "composition strategy" of datasets) if you are interested.

Now, regarding the "-1" index here, I am pretty sure it is a bug (I looked at the past commits and can't understanding why we would do this, unless we are in a situation where the number of labels is the name same, but I cannot figure such as case):

for idx in indices:
    output = self.transform(sample["data"][idx])
    if self._is_transform_with_labels():
        sample["data"][idx] = output[0]
        sample["label"][-1] = output[1]
    else:
        sample["data"][idx] = output

I will also have a look at the TypeError: __init__() got multiple values for argument 'indices'.

Thank you, Quentin

EDIT: by the way, I find the idea of using the transforms with labels for AE pretty smart :)

sujaybabruwad commented 2 years ago

HI, I am also trying to use multiple datasets with different sizes. Is there any update on this front?