Getting Involved and Commenting

PowersPope commented 1 year ago

Hey Daniel,

I added in some comments and Docstrings for the functions within gzip_regressor.py to help people understand what is happening. (people being myself) Thought I would upload it at least. I think I will go through everything and write out comments/Docstrings at least.

I think this looks like a cool project and would love to help contribute. I am a PhD student (finishing up my first year) at the University of Oregon. I work on ML and computational methods for macrocyclic peptide design. I'm still relatively newish in the ML space, so I would love to contribute anyway that I can!

My biggest confusion with this repo is where exactly the data is being pulled from? I see that main can run without any data in the repo. I am guessing it is being pulled from some hosting site? Though I am having troubles finding it within main.

Looking forward to helping out!

janweinreich commented 1 year ago

hey (I am not the owner of the repo but I can answer the 2nd part of your question: have a look at https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html

it is used in https://github.com/daenuprobst/molzip/blob/6247efd8a02d84621290141944fce88126c5f91a/main.py#L137 for the benchmark tests :) This is nice because you just have to pip install deepchem and avoid what causes 90 % of the problems in ML: Data preprocessing.

PowersPope commented 1 year ago

Hey thanks for the explanation! I knew it must be pulling from some sort of API, but couldn't figure it out. That's super clever. Thanks for the info! :)

PowersPope commented 1 year ago

I was looking for something to try out and figured I would give Lasso Regression and add in some extra features. I didn't know if you wanted to include any other features besides just NCD. Though if you don’t then we don’t need to incorporate the onehot function at all. It was more for my curiosity.

Overall it seemed like Lasso Regression was not helpful. The model did worse for both approaches that I tried. I haven’t tried to use Lasso Regression in the past, so it is possible that I incorporated it wrong.

Two approaches that I used:

Incorporated a non-kNN Lasso Regression and a k-NN Lasso Regression. However, I only included the scores from the non-kNN Lasso Regression. Both of them were similar and not very good.
I incorporated a onehot vector to the kNN regression task. I am pretty sure this is not good practice. I still wanted to see what would happen. It actually did better than approach 1. However, it still wasn’t very good.

All my functions are included in the gzip_lasso_regressor.py. I also included a LASSO_REGRESSION.md file. This includes the two test results. I had a couple thoughts on other approaches that could be better than this. Though I would love to hear thoughts on the direction you think this project should go.

I’m looking forward to helping. Let me know if you think I should fix/change anything.

daenuprobst / molzip

Getting Involved and Commenting #8