tokenizer issue with atom resolution

cgoliver commented 1 year ago

To reproduce:


>>> from proteinshake.datasets import TMAlignDataset
>>> da = TMAlignDataset(root='tm')
Downloading tmalign.json.gz:
100%|█████████████████████████████████████████████████████████████| 0.02k/0.02k [00:00<00:00, 3.82MiB/s]
Unzipping...
>>> da.to_point(resolution='atom').torch()
Downloading TMAlignDataset.atom.avro.gz:
100%|█████████████████████████████████████████████████████████████| 0.36k/0.36k [00:00<00:00, 8.01MiB/s]
Unzipping...
Converting:   0%|                                                               | 0/200 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/carlosoliver/Projects/proteinshake/proteinshake/representations/point.py", line 50, in torch
    return TorchPointDataset(self.points, self.size, self.path+'.torch', *args, **kwargs)
  File "/home/carlosoliver/Projects/proteinshake/proteinshake/frameworks/dataset.py", line 33, in __init__
    for data_item in tqdm(data_list, desc='Converting', total=size):
  File "/home/carlosoliver/Projects/proteinshake/.venv/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/carlosoliver/Projects/proteinshake/proteinshake/representations/point.py", line 45, in <genexpr>
    self.points = (Point(protein) for protein in proteins)
  File "/home/carlosoliver/Projects/proteinshake/proteinshake/representations/point.py", line 21, in __init__
    labels = tokenize(protein[resolution][f'{resolution}_type'], resolution=resolution)
  File "/home/carlosoliver/Projects/proteinshake/proteinshake/utils/embeddings.py", line 52, in tokenize
    return np.array([atom_alphabet.index(aa[0]) for aa in sequence])
  File "/home/carlosoliver/Projects/proteinshake/proteinshake/utils/embeddings.py", line 52, in <listcomp>
    return np.array([atom_alphabet.index(aa[0]) for aa in sequence])
ValueError: substring not found
>>> 

Changing resolution to `residue` avoids this error.

cgoliver commented 1 year ago

@timkucera fixed?

timkucera commented 1 year ago

should be fixed with 9b07e5d6d0f6da89c61812a89693955df457d41b, does the bug still occur?

timkucera commented 1 year ago

fixed with a89095928aeb18e9efb321a3811b0872a034c3c6 it was H atoms which we didn't tokenize. I thought we removed them, we might want to revisit this.

BorgwardtLab / proteinshake

tokenizer issue with atom resolution #126