daenuprobst / molzip

The gzip classification method implemented for molecule classification.
MIT License
53 stars 10 forks source link

Added a new QMOF dataset #17

Closed RishikeshMagar closed 1 year ago

RishikeshMagar commented 1 year ago

Hi Daniel, I have added the QMOF dataset 7467 datapoints. I wanted to try out the code for crystalline materials. The results are reasonable compared to the baseline models. I could try additional experiments with the hMOF dataset size which has a size over 100k and add that too.

Moreover, for the regression code. I am not sure about this bit of the code task_preds = [] for vals, dists in zip(np.array(top_k_values).T, np.array(top_k_dists).T): dists = 1 - dists task_preds.append(np.mean(vals * dists) / np.sum(dists))

I printed out some shapes of the array's in the code and found out that for molecule net regression datasets like Delaney and freesolv we have the following shapes in the for loop vals (25,), dists (). The Vals shape is like that because the shape for these arrays n(np.array(top_k_values).T, np.array(top_k_dists).T is top_k_dist (25,), top_k_values (1, 25,1). Based on my understanding of zip, I think that function is just using one distance value(first entry) without considering the 25 distances that we intend to consider. I will have dig more and check but I am not entirely sure if the regression code is working as intended.

In my opinion it should just be an element wise multiplication between the distance(considered inversely with 1-dist) and values. Below is the code that I have for doing regression. Please correct me if I am wrong.

` task_preds = []

top_k_dists_array = np.array(top_k_dists).T
top_k_values_array = np.array(top_k_values).T

dists = 1 - top_k_dists_array  ## weighted distances
weighted_values = top_k_values_array * dists  # element-wise multiplication
task_preds = np.sum(weighted_values) / np.sum(dists)

# print("task_pred", task_preds)
return task_preds`

Because I wasn't entirely sure about gzip_regressor.py. I have created a separate script main_mat.py that is calling the gzip_mat_regressor.py to do the regression. The MOFloader function can be easily integrated into your own original code.