Multi dataset support - Githubissues

Giaco90 commented 2 years ago

Hello all, I'm trying to pretrain an AE model using multiple datasets.

According to documentation VISSL support the use of multiple datasets, but it seems to not work properly:

If the datasets have different size, I will have IndexError. Inspecting the code I noticed that the sampler is created over the first dataset only (length of the first dataset), and the sampled indexes are used with all the datasets (without applying subtraction or module). I can resolve manipulating the index in the dataset getindex, but this requires the redefinition of the disk dataset class;
Due to the AE model, I have to generate online the labels (i.e., image + noise), also because VISSL does not support image-like labels. I'm doing this with a custom transformation (added to _TRANSFORMS_WITH_LABELS) and recycling the simclr_collator (default collator generate a multi-input that is not handled by the default train_step). The problem is that the application of transformation with labels is done as following (code):

...
for idx in indices:
    output = self.transform(sample["data"][idx])
    if self._is_transform_with_labels():
        sample["data"][idx] = output[0]
        sample["label"][-1] = output[1]
    else:
        sample["data"][idx] = output
...

that -1 will break the label assignment because if sample["label"] is initialised (and it is, because we can only use sample_index or zero as LABEL_TYPE) it will ever overwrite the last one. Why it is not indexed with 'idx' as for sample["data"]?

At this point I think I can:

remove the transformation and create a custom collate to apply the label generation and the merge of the two dataset;
~~create a transformation that generate the labels and collate at the same time (and use a default collator);~~ (not feasible)
merge the two datasets, but I will have to move the per-dataset transformation inside the disk dataset class.

Is this correct? Or am I missing something? Do you have any suggestion? Is the multi-dataset really supported or I miss understood it?

Thanks for your answer and sorry for the long post :D

Giacomo

EDIT: I think I found another possible bug when using dataset-specific transformations. When I specify "indices" in the yaml conf I get the following error:

TypeError: __init__() got multiple values for argument 'indices'

I think the problem is in the from_config of the transform wrapper here, the config.get should be a config.pop

QuentinDuval commented 2 years ago

Hi @Giaco90,

First of all, thanks a lot for your interest in VISSL.

I had a look at the code and it seems that the multi-dataset support of VISSL is really lacking indeed. From what I understand, the current design allows to:

Have multiple dataset of the same length being passed
Upon sampling, sample for each of them, returning a pair of images

        item = {"data": [], "data_valid": [], "data_idx": []}
        for data_source in self.data_objs:
            data, valid = data_source[subset_idx]
            item["data"].append(data)
            item["data_idx"].append(idx)
            item["data_valid"].append(1 if valid else -1)

I don't think this use case is actually super useful. I would think that we should instead have the two datasets being concatenated and so we would end up with a dataset of increased length. That's something that could be introduced (we could add a configuration somewhere to change the "composition strategy" of datasets) if you are interested.

Now, regarding the "-1" index here, I am pretty sure it is a bug (I looked at the past commits and can't understanding why we would do this, unless we are in a situation where the number of labels is the name same, but I cannot figure such as case):

for idx in indices:
    output = self.transform(sample["data"][idx])
    if self._is_transform_with_labels():
        sample["data"][idx] = output[0]
        sample["label"][-1] = output[1]
    else:
        sample["data"][idx] = output

I will also have a look at the TypeError: __init__() got multiple values for argument 'indices'.

Thank you, Quentin

EDIT: by the way, I find the idea of using the transforms with labels for AE pretty smart :)

sujaybabruwad commented 2 years ago

HI, I am also trying to use multiple datasets with different sizes. Is there any update on this front?

facebookresearch / vissl

Multi dataset support #537