Closed drugilsberg closed 2 years ago
It might depend on the instance of SmilesLanguage
that is passed to the SMILESTokenizerDataset
: https://github.com/PaccMann/paccmann_datasets/blob/63a6441190f6b5098fecc59d50fc5cf3e1f32de2/pytoda/datasets/smiles_dataset.py#L208
The case where no smiles_language is passed, but constructed internally on construction is well tested.
Through SmilesLanguage.smiles_to_token_indexes boils down to what the SmilesLanguage.transform_encoding
returns. In the case of the base SmilesLanguage
(vs. SmilesTokenizer
), that is currently an identity function, that is indeed it will return a list and not a tensor (of token indexes).
We could/should fix that default transform_encoding
to return a tensor as well.
But this brings up an issue I wanted to discuss anyway:
There is no requirement or guarantee on the return type of indexing a Dataset (e.g. the snipped above indexes SMILESDataset
returning a string).
On the other hand pytoda's Datasets are used with standard DataLoaders, which are pretty generic on the input type.
I think it might actually be a bad idea that we cast to tensor and define the device
in the dataset indexing already. I think it should be the job of the DataLoader, as exemplified for multi GPU training by torch lightning: https://pytorch-lightning.readthedocs.io/en/0.7.2/multi_gpu.html#remove-samplers
Just seeing this. Basically, the issue is that since 0.2.0 the notion of SMILESLanguage
has changed and what used to be the SMILESLanguage
is now a SMILESTokenizer
. This should be reflected in the documentation of all the dataset classes since effectively, you should pass a SMILESTokenizer
. If you pass a SMILESLanguage
instead, Strings will be returned instead of Tensors.
It should also be noted that if a SMILESTokenizer
is passed, all the transforms are set in the tokenizer. Hence, in this case all the SMILES config parameters that are passed to e.g. DrugSensitivityDataset
(such as canonical=True
) are not used. I'm not a huge fan of this setup since it's a bit unintuitive.
The device/dataloder issue should be discussed separately, but at this point, I don't see a huge gain in changing this.
SmilesLanguage
(or rather SmilesLanguage.smiles_to_token_indexes
will not return a string, but a List[int]
(in the base class at least, documented as Union[Indexes, Tensor]
)
The inheriting class SMILESTokenizer
does not overload this method, but only defines additional transformations of the
initial List[int]
including ToTensor
which will cast it to tensor using device
keyword.
I agree that the device/dataloader discussion is a larger argument, but here I think we can get rid of any torch dependency:
We remove the device
argument from SMILESTokenizer
, removing the ToTensor
transformation and document SmilesLanguage.smiles_to_token_indexes tor return Indexes
.
The cast to Tensor can then be done in the Dataset. This will allow us to change it also in the Dataset later if we decide so.
With this change the dataset is agnostic to the implementation of SmilesLanguage
as intended.
I agree, this suggestion seems valuable
Actually, building on this I have another idea. It ties in a bit with the discussion we had on #95 about Datasets being static samples vs distributional.
The thing is that distributional items as generated currently with SmilesTokenizerDataset with augmentation will not stay distributional if one wanted to convert it to a out-of-memory dataset (such as with diskcache or dedicated dataformats that I will add soonish #36).
So any dataset using a smiles_language as of now cannot be stored and loaded itself, and all the paired datasets will still have to do the lookup in the library of molecules/omics/proteins (which can be stored and loaded).
So the idea is to move the processing, that is the use of a smiles_language as far to the top as can be.
There can still be a SmilesTokenizerDataset that uses a language, but DrugAffinityDatasets would not use this one but only SmilesLanguage. We would also only pass the instance.
This would allow us to store the same samples, and just change the processing/transformations (smiles_language) in the top level between training stages (fit/test).
@jannisborn have a look at Lightning DataModules https://pytorch-lightning.readthedocs.io/en/stable/datamodules.html#setup
This way we would clean all docs of datasets from the many language arguments. In the DataModule the smiles_language instantiation has a clear place, can be parametrized and even shared (i.e. define in pytoda or your project repo once, import and initialise in different training scripts).
Closing this since there is no concrete call to action and a larger refactor does not seem useful given that this issue has not bothered anyone for a year
https://github.com/PaccMann/paccmann_datasets/blob/63a6441190f6b5098fecc59d50fc5cf3e1f32de2/pytoda/datasets/drug_affinity_dataset.py#L208