Closed oliverhvidsten closed 1 year ago
My guess is that the SMILES string is not being stored in eq_graph_tensors
. What does that look like? If you are able to share a reproducible example that does not contain sensitive information then that would be additionally helpful.
After debugging a little bit, it seems like the smiles that caused the error is the first value in the testset. Only trainset datapoints appear in eq_graph_tensors. Looking more into that now
That's right. eq_graph_tensors
stores the data needed for augmentation. Augmentation is a strategy just used to improve the model accuracy during training. Therefore, it is not intended for use with test polymers. Is this clear?
Yes, it does. Currently trying to figure out why it is trying to access test points.
I was able to resolve the issue. The code in the section 'prepare data' was not properly resplitting my train and test dataframes after combining for pt.prepare.prepare_train() I wrote new code and now works properly.
Ok. I'm still not exactly sure what the issue is, but if you share the fix, I can look it over and potentially merge it in.
I am reading separate train and test files in, so its entirely possible that is the reason why I got an error.
Code I added: group_train_data['traintest'] = 'train' group_test_data['traintest'] = 'test' group_data = pd.concat([group_train_data, group_test_data], ignore_index=False) group_data, scaler_dict = pt.prepare.prepare_train( group_data, smiles_featurizer=smiles_featurizer, root_dir=root_dir ) print([(k, str(v)) for k, v in scaler_dict.items()]) group_train_data = group_data.loc[group_data['traintest'] == 'train', :] group_test_data = group_data.loc[group_data['traintest'] == 'test', :]
Code I removed:
group_train_inds = group_train_data.index.values.tolist()
group_test_inds = group_test_data.index.values.tolist()
group_data = pd.concat([group_train_data, group_test_data], ignore_index=False)
group_data, scaler_dict = pt.prepare.prepare_train(
group_data, smiles_featurizer=smiles_featurizer, root_dir=root_dir
)
print([(k, str(v)) for k, v in scaler_dict.items()])
group_train_data = group_data.loc[group_train_inds, :]
group_test_data = group_data.loc[group_test_inds, :]
Before I changed the code, my train-size was 300 and test-size was 48 going into the code and then after, my trainsize was 348 and test points appeared in the train set. After making changes, my trainset and testset stay identical lengths before and after passing through this section of code.
Thanks for sharing. That is strange! I thought that using .loc
instead of .iloc
would prevent that issue from occurring ...
See below. .iloc
gives the wrong output but .loc
gives the right one.
import pandas as pd
# Create a sample DataFrame
data = {
'Name': ['John', 'Jane', 'Alex', 'Emily', 'Michael'],
'Age': [25, 30, 20, 35, 28],
'City': ['New York', 'Paris', 'London', 'London', 'Tokyo']
}
data = pd.DataFrame(data)
# Setting custom index labels
data.index = [4, 3, 2, 1, 0]
# Train/test split
train = data.loc[data["City"] == "London"]
test = data.drop(index=train.index)
# Get indices
train_idx = train.index.values.tolist()
test_idx = test.index.values.tolist()
# Concat data
data = pd.concat([train, test], ignore_index=False)
# Using loc to select data
print("Using loc:")
print(data.loc[train_idx, :])
print()
# Using iloc to select data
print("Using iloc:")
print(data.iloc[train_idx, :])
The output is
Using loc:
Name Age City
2 Alex 20 London
1 Emily 35 London
Using iloc:
Name Age City
4 John 25 New York
1 Emily 35 London
Closing this since I do not have a reproducible example to work off of. Feel free to reopen if you still have questions.
Do you have any ideas? I have been reading through the code a little to try and figure out why this may be happening.
Traceback (most recent call last): File "GNN_CV_training.py", line 434, in
random_seed=RANDOM_SEED,
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/train.py", line 373, in train_kfold_ensemble
model, train_pts, val_pts, scaler_dict, train_config, break_bad_grads=False
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/train.py", line 182, in train_submodel
train_pts = tc.get_train_dataloader()
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/train.py", line 368, in
training_df
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/train.py", line 345, in cv_get_train_dataloader
return training_df.apply(get_data_augmented, axis=1).values.tolist()
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/pandas/core/frame.py", line 7552, in apply
return op.get_result()
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/pandas/core/apply.py", line 185, in get_result
return self.apply_standard()
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/pandas/core/apply.py", line 276, in apply_standard
results, res_index = self.apply_series_generator()
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/pandas/core/apply.py", line 305, in apply_series_generator
results[i] = self.f(v)
File "/home/oliver/.cache/pypoetry/virtualenvs/polygnn-5wmT02iB-py3.7/lib/python3.7/site-packages/polygnn_trainer/train.py", line 331, in get_data_augmented
data = augmented_featurizer(x.smiles_string)
File "GNN_CV_training.py", line 270, in
augmented_featurizer = lambda x: random.sample(eq_graph_tensors[x], k=1)[0]
KeyError: '[]C(C)(C(=O)OCCCS(=O)(=O)[N-]S(=O)(=O)C(F)(F)F)COCCOCCOCCOCCOCCOCCOCCOCCOCCOCC[].[Li+]'