Open schinto opened 3 years ago
Hi! Currently TorchDrug doesn't handle missing values in the dataset. One solution is to mask out the samples with missing labels by creating a sample mask and apply it in torch.utils.data.Subset
, though this may not best fit multi-task setting.
In general, we may accept a feature request for missing values in property prediction. However, I am not sure what a robust solution is for missing values in multi-task predictions. If the community can agree at some solutions, we can include it as a part of TorchDrug.
Hi, I tried to build a property prediction model using the OPV dataset. See code below. Training a GIN model using all 8 tasks fails due to missing values in the 4 subtasks ending in
_extrapolated
. However, model training does not stop even when all values getnan
. When the 4 subtasks with missing values are excluded model training works fine.How does torchdrug deal with missing values in subtasks?
I'm asking as I would like to find out how robust multitask GIN models are to data sparsity. See Effect of missing data on multitask prediction methods
Thanks!