Open Giaco90 opened 2 years ago
Hi @Giaco90,
First of all, thanks a lot for your interest in VISSL.
I had a look at the code and it seems that the multi-dataset support of VISSL is really lacking indeed. From what I understand, the current design allows to:
item = {"data": [], "data_valid": [], "data_idx": []}
for data_source in self.data_objs:
data, valid = data_source[subset_idx]
item["data"].append(data)
item["data_idx"].append(idx)
item["data_valid"].append(1 if valid else -1)
I don't think this use case is actually super useful. I would think that we should instead have the two datasets being concatenated and so we would end up with a dataset of increased length. That's something that could be introduced (we could add a configuration somewhere to change the "composition strategy" of datasets) if you are interested.
Now, regarding the "-1" index here, I am pretty sure it is a bug (I looked at the past commits and can't understanding why we would do this, unless we are in a situation where the number of labels is the name same, but I cannot figure such as case):
for idx in indices:
output = self.transform(sample["data"][idx])
if self._is_transform_with_labels():
sample["data"][idx] = output[0]
sample["label"][-1] = output[1]
else:
sample["data"][idx] = output
I will also have a look at the TypeError: __init__() got multiple values for argument 'indices'
.
Thank you, Quentin
EDIT: by the way, I find the idea of using the transforms with labels for AE pretty smart :)
HI, I am also trying to use multiple datasets with different sizes. Is there any update on this front?
Hello all, I'm trying to pretrain an AE model using multiple datasets.
According to documentation VISSL support the use of multiple datasets, but it seems to not work properly:
that -1 will break the label assignment because if sample["label"] is initialised (and it is, because we can only use sample_index or zero as LABEL_TYPE) it will ever overwrite the last one. Why it is not indexed with 'idx' as for sample["data"]?
At this point I think I can:
create a transformation that generate the labels and collate at the same time (and use a default collator);(not feasible)Is this correct? Or am I missing something? Do you have any suggestion? Is the multi-dataset really supported or I miss understood it?
Thanks for your answer and sorry for the long post :D
Giacomo
EDIT: I think I found another possible bug when using dataset-specific transformations. When I specify "indices" in the yaml conf I get the following error:
TypeError: __init__() got multiple values for argument 'indices'
I think the problem is in the from_config of the transform wrapper here, the
config.get
should be aconfig.pop