Ramprasad-Group / polygnn

polyGNN is a Python library to automate ML model training for polymer informatics.
Other
30 stars 5 forks source link

Smiles Key Error #12

Closed oliverhvidsten closed 1 year ago

oliverhvidsten commented 1 year ago

Do you have any ideas? I have been reading through the code a little to try and figure out why this may be happening.

Traceback (most recent call last): File "GNN_CV_training.py", line 434, in random_seed=RANDOM_SEED, File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/train.py", line 373, in train_kfold_ensemble model, train_pts, val_pts, scaler_dict, train_config, break_bad_grads=False File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/train.py", line 182, in train_submodel train_pts = tc.get_train_dataloader() File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/train.py", line 368, in training_df File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/train.py", line 345, in cv_get_train_dataloader return training_df.apply(get_data_augmented, axis=1).values.tolist() File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/pandas/core/frame.py", line 7552, in apply return op.get_result() File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/pandas/core/apply.py", line 185, in get_result return self.apply_standard() File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/pandas/core/apply.py", line 276, in apply_standard results, res_index = self.apply_series_generator() File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/pandas/core/apply.py", line 305, in apply_series_generator results[i] = self.f(v) File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/train.py", line 331, in get_data_augmented data = augmented_featurizer(x.smiles_string) File "GNN_CV_training.py", line 270, in augmented_featurizer = lambda x: random.sample(eq_graph_tensors[x], k=1)[0] KeyError: '[]C(C)(C(=O)OCCCS(=O)(=O)[N-]S(=O)(=O)C(F)(F)F)COCCOCCOCCOCCOCCOCCOCCOCCOCCOCC[].[Li+]'

rishigurnani commented 1 year ago

My guess is that the SMILES string is not being stored in eq_graph_tensors. What does that look like? If you are able to share a reproducible example that does not contain sensitive information then that would be additionally helpful.

oliverhvidsten commented 1 year ago

After debugging a little bit, it seems like the smiles that caused the error is the first value in the testset. Only trainset datapoints appear in eq_graph_tensors. Looking more into that now

rishigurnani commented 1 year ago

That's right. eq_graph_tensors stores the data needed for augmentation. Augmentation is a strategy just used to improve the model accuracy during training. Therefore, it is not intended for use with test polymers. Is this clear?

oliverhvidsten commented 1 year ago

Yes, it does. Currently trying to figure out why it is trying to access test points.

oliverhvidsten commented 1 year ago

I was able to resolve the issue. The code in the section 'prepare data' was not properly resplitting my train and test dataframes after combining for pt.prepare.prepare_train() I wrote new code and now works properly.

rishigurnani commented 1 year ago

Ok. I'm still not exactly sure what the issue is, but if you share the fix, I can look it over and potentially merge it in.

oliverhvidsten commented 1 year ago

I am reading separate train and test files in, so its entirely possible that is the reason why I got an error.

Code I added: group_train_data['traintest'] = 'train' group_test_data['traintest'] = 'test' group_data = pd.concat([group_train_data, group_test_data], ignore_index=False) group_data, scaler_dict = pt.prepare.prepare_train( group_data, smiles_featurizer=smiles_featurizer, root_dir=root_dir ) print([(k, str(v)) for k, v in scaler_dict.items()]) group_train_data = group_data.loc[group_data['traintest'] == 'train', :] group_test_data = group_data.loc[group_data['traintest'] == 'test', :]


Code I removed:

group_train_inds = group_train_data.index.values.tolist()
group_test_inds = group_test_data.index.values.tolist()
group_data = pd.concat([group_train_data, group_test_data], ignore_index=False)
group_data, scaler_dict = pt.prepare.prepare_train(
    group_data, smiles_featurizer=smiles_featurizer, root_dir=root_dir
)
print([(k, str(v)) for k, v in scaler_dict.items()])
group_train_data = group_data.loc[group_train_inds, :]
group_test_data = group_data.loc[group_test_inds, :]

Before I changed the code, my train-size was 300 and test-size was 48 going into the code and then after, my trainsize was 348 and test points appeared in the train set. After making changes, my trainset and testset stay identical lengths before and after passing through this section of code.

rishigurnani commented 1 year ago

Thanks for sharing. That is strange! I thought that using .loc instead of .iloc would prevent that issue from occurring ...

See below. .iloc gives the wrong output but .loc gives the right one.

import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['John', 'Jane', 'Alex', 'Emily', 'Michael'],
    'Age': [25, 30, 20, 35, 28],
    'City': ['New York', 'Paris', 'London', 'London', 'Tokyo']
}

data = pd.DataFrame(data)

# Setting custom index labels
data.index = [4, 3, 2, 1, 0]

# Train/test split
train = data.loc[data["City"] == "London"]
test = data.drop(index=train.index)

# Get indices
train_idx = train.index.values.tolist()
test_idx = test.index.values.tolist()

# Concat data
data = pd.concat([train, test], ignore_index=False)

# Using loc to select data
print("Using loc:")
print(data.loc[train_idx, :])

print()

# Using iloc to select data
print("Using iloc:")
print(data.iloc[train_idx, :])

The output is

Using loc:
    Name  Age    City
2   Alex   20  London
1  Emily   35  London

Using iloc:
    Name  Age      City
4   John   25  New York
1  Emily   35    London
rishigurnani commented 1 year ago

Closing this since I do not have a reproducible example to work off of. Feel free to reopen if you still have questions.