languages, where to use and what type to return by to_token_indexes

PaccMann / paccmann_datasets

pytoda - PaccMann PyTorch Dataset Classes. Read the docs: https://paccmann.github.io/paccmann_datasets/

MIT License

28 stars 7 forks source link

languages, where to use and what type to return by to_token_indexes #98

Closed drugilsberg closed 2 years ago

drugilsberg commented 4 years ago

https://github.com/PaccMann/paccmann_datasets/blob/63a6441190f6b5098fecc59d50fc5cf3e1f32de2/pytoda/datasets/drug_affinity_dataset.py#L208

C-nit commented 4 years ago

It might depend on the instance of SmilesLanguage that is passed to the SMILESTokenizerDataset: https://github.com/PaccMann/paccmann_datasets/blob/63a6441190f6b5098fecc59d50fc5cf3e1f32de2/pytoda/datasets/smiles_dataset.py#L208

The case where no smiles_language is passed, but constructed internally on construction is well tested.

Through SmilesLanguage.smiles_to_token_indexes boils down to what the SmilesLanguage.transform_encoding returns. In the case of the base SmilesLanguage (vs. SmilesTokenizer), that is currently an identity function, that is indeed it will return a list and not a tensor (of token indexes).

We could/should fix that default transform_encoding to return a tensor as well.

But this brings up an issue I wanted to discuss anyway: There is no requirement or guarantee on the return type of indexing a Dataset (e.g. the snipped above indexes SMILESDataset returning a string). On the other hand pytoda's Datasets are used with standard DataLoaders, which are pretty generic on the input type. I think it might actually be a bad idea that we cast to tensor and define the device in the dataset indexing already. I think it should be the job of the DataLoader, as exemplified for multi GPU training by torch lightning: https://pytorch-lightning.readthedocs.io/en/0.7.2/multi_gpu.html#remove-samplers

jannisborn commented 3 years ago

Just seeing this. Basically, the issue is that since 0.2.0 the notion of SMILESLanguage has changed and what used to be the SMILESLanguage is now a SMILESTokenizer. This should be reflected in the documentation of all the dataset classes since effectively, you should pass a SMILESTokenizer. If you pass a SMILESLanguage instead, Strings will be returned instead of Tensors.

It should also be noted that if a SMILESTokenizer is passed, all the transforms are set in the tokenizer. Hence, in this case all the SMILES config parameters that are passed to e.g. DrugSensitivityDataset (such as canonical=True) are not used. I'm not a huge fan of this setup since it's a bit unintuitive.

jannisborn commented 3 years ago

The device/dataloder issue should be discussed separately, but at this point, I don't see a huge gain in changing this.

C-nit commented 3 years ago

SmilesLanguage (or rather SmilesLanguage.smiles_to_token_indexes will not return a string, but a List[int] (in the base class at least, documented as Union[Indexes, Tensor])

The inheriting class SMILESTokenizer does not overload this method, but only defines additional transformations of the initial List[int] including ToTensor which will cast it to tensor using device keyword.

I agree that the device/dataloader discussion is a larger argument, but here I think we can get rid of any torch dependency:

We remove the device argument from SMILESTokenizer, removing the ToTensor transformation and document SmilesLanguage.smiles_to_token_indexes tor return Indexes. The cast to Tensor can then be done in the Dataset. This will allow us to change it also in the Dataset later if we decide so.

With this change the dataset is agnostic to the implementation of SmilesLanguage as intended.

jannisborn commented 3 years ago

I agree, this suggestion seems valuable

C-nit commented 3 years ago

Actually, building on this I have another idea. It ties in a bit with the discussion we had on #95 about Datasets being static samples vs distributional.
The thing is that distributional items as generated currently with SmilesTokenizerDataset with augmentation will not stay distributional if one wanted to convert it to a out-of-memory dataset (such as with diskcache or dedicated dataformats that I will add soonish #36). So any dataset using a smiles_language as of now cannot be stored and loaded itself, and all the paired datasets will still have to do the lookup in the library of molecules/omics/proteins (which can be stored and loaded). So the idea is to move the processing, that is the use of a smiles_language as far to the top as can be. There can still be a SmilesTokenizerDataset that uses a language, but DrugAffinityDatasets would not use this one but only SmilesLanguage. We would also only pass the instance.

This would allow us to store the same samples, and just change the processing/transformations (smiles_language) in the top level between training stages (fit/test).

@jannisborn have a look at Lightning DataModules https://pytorch-lightning.readthedocs.io/en/stable/datamodules.html#setup

This way we would clean all docs of datasets from the many language arguments. In the DataModule the smiles_language instantiation has a clear place, can be parametrized and even shared (i.e. define in pytoda or your project repo once, import and initialise in different training scripts).

jannisborn commented 2 years ago

Closing this since there is no concrete call to action and a larger refactor does not seem useful given that this issue has not bothered anyone for a year