Closed just-eoghan closed 2 years ago
I am not sure why this is the validation dataset, Do we validate on only the target domain? How would we handle this validation set if the target domain is unlabelled? If you could explain why this is the case I would appreciate some insight.
This is an unsolved problem in unsupervised domain adaptation. We want high accuracy on the unlabeled target domain, but since it is unlabeled, it is difficult to determine the model's performance.
Whether or not we validate only on the target domain depends on the type of validator. The IMValidator uses only the target domain to compute a validation score, which is why the validation dataloaders returned by filter_datasets
consists of only the target domain:
validator = IMValidator() # uses only the target domain to compute a validation score
self.dataloaders = dc(**filter_datasets(datasets, validator))
You could use a validator that adds source val accuracy plus the IM score:
from pytorch_adapt.validators import MultipleValidators, AccuracyValidator, IMValidator
validator = MultipleValidators([AccuracyValidator(), IMValidator()])
self.dataloaders = dc(**filter_datasets(datasets, validator))
Now the validation dataloaders should consist of the src_val
set and the target_train
set. Note that the target_train
set is used for validation, because it is assumed that the target_val
set is reserved for testing. (This is a bit confusing, however it's the most realistic setting in my opinion. I can expand on this if you want.)
You can make the IMValidator use target_val
instead of target_train
like this:
validator = IMValidator(key_map={"target_val": "target_train"})
In summation I am looking for guidance on is how to use something like a torch.utils.data.random_split to take some of the source and target data and use the DataloaderCreator to pass back test sets along with train and val, is this possible within the framework?
You can split the datasets however you want. As long as the DataloaderCreator recognizes the names of the splits you pass in:
dc = DataloaderCreator(val_names=["src_val", "target_val", "src_test", "target_test"])
dataloaders = dc(src_val=dataset1, target_val=dataset2, src_test=dataset3, target_test=dataset4)
Let me know if you have more questions!
Hi Kevin,
Thanks for your answer it clears everything up nicely.
I think some of those examples would be a super inclusion in the docs!
It would be good to get a little more info on this
it is assumed that the target_val set is reserved for testing
As it's still somewhat confusing to me.
I think some of those examples would be a super inclusion in the docs!
Yeah the docs need to be updated very badly. It's something I should prioritize.
It would be good to get a little more info on this
it is assumed that the target_val set is reserved for testing
As it's still somewhat confusing to me.
In a real world application, you have unlabeled target data that is not yet split into train/val/test etc. To maximize performance, it's best to train on as much target data as possible. So I would use all the target data for training, and validate on it as well. If I split the data into train/val, then I'm reducing the amount of target data to train on, which will likely reduce performance.
Would this cause overfitting on the target training set? My opinion is no, because:
After the model is trained, you deploy it in your application. Now your model is fed unseen data, i.e. the test set. So in an academic setting, I train and validate on "target_train", and use "target_val" as the unseen data. It could be called "target_test" instead.
That's great I understand that approach now.
I think what was catching me was that I personally always use "test" as the name for the hold-out set and "validation" for the set of unseen data used during training to get a feel for performance. But validation and test are interchangeable terms for the hold-out set so either works.
On the docs, if you are interested I may be able to help with them. I've just found the library this week but I've been reading through the code-base to better understand it so I may be useful enough to assist with the docs eventually!
Thank you for your help.
On the docs, if you are interested I may be able to help with them. I've just found the library this week but I've been reading through the code-base to better understand it so I may be useful enough to assist with the docs eventually!
That would be greatly appreciated! I'll try to update the docs this weekend. Right now it's way out of date and also a hassle to edit. I'll let you know when I make the change.
@deepseek-eoghan I updated the docs. Re: contributing to the docs, that would be very helpful, but I think example jupyter notebooks would be even better since I assume that's where most people look first.
Hello,
Well done on putting together this library I think it will be extremely useful for many people undertaking domain adaptation projects.
I am wondering how to create a test dataset using the DataloaderCreator class?
Some background on my issue.
I am using the MNISTM example within a PyTorch lightning data-module.
Adapting the code from the examples/DANNLightning.ipynb I have the following code.
self.dataloaders produces the following object
This handles train and val for source and target as well as creating a conjoined train dataset.
Going by the example ipynb, the concat dataset for train (of source and target) is used as the training dataset for the model.
The validation set is a list of the remaining keys in the data-loader and has the following form.
I am not sure why this is the validation dataset, Do we validate on only the target domain? How would we handle this validation set if the target domain is unlabelled? If you could explain why this is the case I would appreciate some insight.
In summation I am looking for guidance on is how to use something like a torch.utils.data.random_split to take some of the source and target data and use the DataloaderCreator to pass back test sets along with train and val, is this possible within the framework?
Many thanks, Eoghan