Week 1 - Get to know the community

[x] Join the communication channels
[x] Open a GitHub issue (this one!)
[x] Install the Ersilia Model Hub and test the simplest model
[x] Install Docker if needed, and test another model
[x] Write a motivation statement to work at Ersilia
[x] Submit your first contribution to the Outreachy site

Week 2 - Get Familiar with Machine Learning for Chemistry

[x] Select a model from the list suggested in GitBook
[x] Download and serve the model via the Ersilia Model Hub to ensure it works
[x] Open a repository on your GitHub user with all the necessary files
[x] Select and clean a dataset of 1000 molecules (example notebook 1)
[x] Run predictions for the molecules on the selected model and evaluate the results

Week 3 - Validate a Model in the Wild

[x] Find a suitable dataset with sufficient experimental results
[x] Clean and standardize the dataset
[x] Run predictions and calculate metrics.

Week 4 - Prepare your final application

[x] Submit the final application in the Outreachy website

Week 1 - Get to know the community

Explored the Ersilia-os website and was amazed by the caliber of minds working there.
Joined Slack community.
Opened this issue.
Studied the provided HANDBOOK as a contribution guide diligently.
Followed instructions by downloading and installing Ubuntu wsl --install, along with setting up Ersiliaand its dependencies.

Setup environment and activated ersilia.


# create a conda environment
conda create -n ersilia python=3.7
# activate the environment
conda activate ersilia
Retrieving notices: ...working... done
WARNING: A conda environment already exists at '/home/taliqamuhib/miniconda3/envs/ersilia'
Remove existing environment (y/[n])? y

Channels:

defaults Platform: linux-64 Collecting package metadata (repodata.json): done Solving environment: done

Package Plan

environment location: /home/taliqamuhib/miniconda3/envs/ersilia

added / updated specs:

python=3.7

The following NEW packages will be INSTALLED:

_libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu ca-certificates pkgs/main/linux-64::ca-certificates-2023.12.12-h06a4308_0 certifi pkgs/main/linux-64::certifi-2022.12.7-py37h06a4308_0 ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.38-h1181459_1 libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_0 libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 openssl pkgs/main/linux-64::openssl-1.1.1w-h7f8727e_0 pip pkgs/main/linux-64::pip-22.3.1-py37h06a4308_0 python pkgs/main/linux-64::python-3.7.16-h7a1cb2a_0 readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 setuptools pkgs/main/linux-64::setuptools-65.6.3-py37h06a4308_0 sqlite pkgs/main/linux-64::sqlite-3.41.2-h5eee18b_0 tk pkgs/main/linux-64::tk-8.6.12-h1ccaba5_0 wheel pkgs/main/linux-64::wheel-0.38.4-py37h06a4308_0 xz pkgs/main/linux-64::xz-5.4.6-h5eee18b_0 zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_0

Proceed ([y]/n)? y

Downloading and Extracting Packages:

Preparing transaction: done Verifying transaction: done Executing transaction: done #

To activate this environment, use

$ conda activate ersilia

To deactivate an active environment, use

$ conda deactivate

- Installed `[Docker](https://docs.docker.com/desktop/wsl/)` to ensure seamless integration with Ersilia's environment.

(ersilia) taliqamuhib@Taliqa-Muhib:~$ pip install docker Collecting docker Using cached docker-6.1.3-py3-none-any.whl (148 kB) Collecting requests>=2.26.0 Using cached requests-2.31.0-py3-none-any.whl (62 kB) Collecting packaging>=14.0 Downloading packaging-24.0-py3-none-any.whl (53 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.5/53.5 kB 278.7 kB/s eta 0:00:00 Collecting urllib3>=1.26.0 Using cached urllib3-2.0.7-py3-none-any.whl (124 kB) Collecting websocket-client>=0.32.0 Using cached websocket_client-1.6.1-py3-none-any.whl (56 kB) Requirement already satisfied: certifi>=2017.4.17 in ./miniconda3/envs/ersilia/lib/python3.7/site-packages (from requests>=2.26.0->docker) (2022.12.7) Collecting idna<4,>=2.5 Using cached idna-3.6-py3-none-any.whl (61 kB) Collecting charset-normalizer<4,>=2 Using cached charset_normalizer-3.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (136 kB) Installing collected packages: websocket-client, urllib3, packaging, idna, charset-normalizer, requests, docker Successfully installed charset-normalizer-3.3.2 docker-6.1.3 idna-3.6 packaging-24.0 requests-2.31.0 urllib3-2.0.7 websocket-client-1.6.1 (ersilia) taliqamuhib@Taliqa-Muhib:~$ wsl.exe -l -v NAME STATE VERSION

Ubuntu Running 2 (ersilia) taliqamuhib@Taliqa-Muhib:~$ docker --version Docker version 24.0.2, build cb74dfc
Install the Ersilia Python package.

# clone from github
git clone https://github.com/ersilia-os/ersilia.git
cd ersilia
# install with pip (use -e for developer mode)
pip install -e .

Cloned the Github Repo

(ersilia) taliqamuhib@Taliqa-Muhib:~$ # clone from github
git clone https://github.com/ersilia-os/ersilia.git
cd ersilia
# install with pip (use -e for developer mode)
pip install -e .
Obtaining file:///home/taliqamuhib/ersilia
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... done
Preparing editable metadata (pyproject.toml) ... done
Collecting tqdm<5.0.0,>=4.66.1
Using cached tqdm-4.66.2-py3-none-any.whl (78 kB)
Collecting boto3<2.0.0,>=1.28.40
Using cached boto3-1.33.13-py3-none-any.whl (139 kB)
Collecting validators==0.20.0
Using cached validators-0.20.0-py3-none-any.whl
Collecting click<9.0.0,>=8.1.7
Using cached click-8.1.7-py3-none-any.whl (97 kB)
Collecting h5py<4.0.0,>=3.7.0
Using cached h5py-3.8.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB)
Collecting emoji<3.0.0,>=2.8.0
Using cached emoji-2.10.1-py2.py3-none-any.whl (421 kB)
Collecting pandas==1.2.4
Using cached pandas-1.2.4-cp37-cp37m-manylinux1_x86_64.whl (9.9 MB)
Requirement already satisfied: requests<3.0.0,>=2.31.0 in /home/taliqamuhib/miniconda3/envs/ersilia/lib/python3.7/site-packages (from ersilia==0.1.32) (2.31.0)
Collecting dockerfile-parse<3.0.0,>=2.0.1
Using cached dockerfile_parse-2.0.1-py2.py3-none-any.whl (14 kB)
Requirement already satisfied: docker<7.0.0,>=6.1.3 in /home/taliqamuhib/miniconda3/envs/ersilia/lib/python3.7/site-packages (from ersilia==0.1.32) (6.1.3)
Collecting pyairtable<2
Using cached pyairtable-1.5.0-py2.py3-none-any.whl (27 kB)
Collecting inputimeout<2.0.0,>=1.0.4
Using cached inputimeout-1.0.4-py3-none-any.whl (4.6 kB)
Collecting PyYAML<7.0.0,>=6.0.1
Using cached PyYAML-6.0.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (670 kB)
Collecting loguru<0.7.0,>=0.6.0
Using cached loguru-0.6.0-py3-none-any.whl (58 kB)
Collecting pytz>=2017.3
Using cached pytz-2024.1-py2.py3-none-any.whl (505 kB)
Collecting python-dateutil>=2.7.3
Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
Collecting numpy>=1.16.5
Using cached numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
Collecting decorator>=3.4.0
Using cached decorator-5.1.1-py3-none-any.whl (9.1 kB)
Collecting jmespath<2.0.0,>=0.7.1
Using cached jmespath-1.0.1-py3-none-any.whl (20 kB)
Collecting botocore<1.34.0,>=1.33.13
Using cached botocore-1.33.13-py3-none-any.whl (11.8 MB)
Collecting s3transfer<0.9.0,>=0.8.2
Using cached s3transfer-0.8.2-py3-none-any.whl (82 kB)
Collecting importlib-metadata
Using cached importlib_metadata-6.7.0-py3-none-any.whl (22 kB)
Requirement already satisfied: packaging>=14.0 in /home/taliqamuhib/miniconda3/envs/ersilia/lib/python3.7/site-packages (from docker<7.0.0,>=6.1.3->ersilia==0.1.32) (24.0)
Requirement already satisfied: websocket-client>=0.32.0 in /home/taliqamuhib/miniconda3/envs/ersilia/lib/python3.7/site-packages (from docker<7.0.0,>=6.1.3->ersilia==0.1.32) (1.6.1)
Requirement already satisfied: urllib3>=1.26.0 in /home/taliqamuhib/miniconda3/envs/ersilia/lib/python3.7/site-packages (from docker<7.0.0,>=6.1.3->ersilia==0.1.32) (2.0.7)
Collecting urllib3>=1.26.0
Using cached urllib3-1.26.18-py2.py3-none-any.whl (143 kB)
Requirement already satisfied: idna<4,>=2.5 in /home/taliqamuhib/miniconda3/envs/ersilia/lib/python3.7/site-packages (from requests<3.0.0,>=2.31.0->ersilia==0.1.32) (3.6)
Requirement already satisfied: certifi>=2017.4.17 in /home/taliqamuhib/miniconda3/envs/ersilia/lib/python3.7/site-packages (from requests<3.0.0,>=2.31.0->ersilia==0.1.32) (2022.12.7)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/taliqamuhib/miniconda3/envs/ersilia/lib/python3.7/site-packages (from requests<3.0.0,>=2.31.0->ersilia==0.1.32) (3.3.2)
Collecting six>=1.5
Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting typing-extensions>=3.6.4
Using cached typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Collecting zipp>=0.5
Using cached zipp-3.15.0-py3-none-any.whl (6.8 kB)
Building wheels for collected packages: ersilia
Building editable for ersilia (pyproject.toml) ... done
Created wheel for ersilia: filename=ersilia-0.1.32-py3-none-any.whl size=16571 sha256=04fe4ed2109937704c68d575b6a460047411f6aeb5820557a7418658cd8328e1
Stored in directory: /tmp/pip-ephem-wheel-cache-969dvxoh/wheels/64/61/ca/99c49e77d88adae342f7334a29edca8bbd0512fc4b7fd6762c
Successfully built ersilia
Installing collected packages: pytz, zipp, urllib3, typing-extensions, tqdm, six, PyYAML, numpy, loguru, jmespath, inputimeout, emoji, dockerfile-parse, decorator, validators, python-dateutil, importlib-metadata, h5py, pyairtable, pandas, click, botocore, s3transfer, boto3, ersilia
Attempting uninstall: urllib3
Found existing installation: urllib3 2.0.7
Uninstalling urllib3-2.0.7:
  Successfully uninstalled urllib3-2.0.7
Successfully installed PyYAML-6.0.1 boto3-1.33.13 botocore-1.33.13 click-8.1.7 decorator-5.1.1 dockerfile-parse-2.0.1 emoji-2.10.1 ersilia-0.1.32 h5py-3.8.0 importlib-metadata-6.7.0 inputimeout-1.0.4 jmespath-1.0.1 loguru-0.6.0 numpy-1.21.6 pandas-1.2.4 pyairtable-1.5.0 python-dateutil-2.9.0.post0 pytz-2024.1 s3transfer-0.8.2 six-1.16.0 tqdm-4.66.2 typing-extensions-4.7.1 urllib3-1.26.18 validators-0.20.0 zipp-3.15.0

Checked that the CLI working on terminal. ✔️

Hello World! 👋

ABOUT ME!

Hi! I am Taliqa Muhib, a Curly haired Pakistani girl from the Karakorum mountains who often codes, sings and write, and part time gamer. 💻

HOW I GOT INTO CS AND AI?

Being a girl, growing up in a rural area with limited resources, along the stereotypical discouraging thoughts about women education, It was always dream to expand my expertise to an international level. Breaking the conventions, I left my home and got admission in public university 500 miles away. During my bachelor's degree I got some online courses where I built my Machine learning basics and apply in my research where I reserved distinction. With that zeal, I am currently pursuing my master's in computer science, my fascination with AI and research has led me into the realms of machine learning, deep learning, and generative AI. I'm eager to apply these skills in real-time scenarios and I believe Outreachy is the best platform. 🌍💻📚

WHY ERSILIA IN OUTRECHY?

After I got email for selection of initial application, When I came across Ersilia's search for interns, it felt like the perfect opportunity to align my interests and expertise. Their focus on Open Source Artificial Intelligence for Neglected Diseases resonated deeply with me. Having experienced firsthand the challenges of limited access to healthcare services and the devastating impact of neglected diseases in rural areas, I felt a strong connection to their mission. 🤝🏥💡

When I was checking about @Ersilia, I watched video which truly struck a chord when @GemmaTuron, the CEO of Ersilia, mentioned the disparity in the cure of diseases between rural areas in developing countries and developed nations. This inequality in access to healthcare and treatment options deeply resonated with my own experiences growing up in a rural area with limited resources. 💔🌐

When i was in my last year of my bachelors, one of my friends experienced the unbearable pain of renal stones. I Witnessed her suffering, it was heartbreaking, especially knowing that she was misdiagnosed eating painkillers which eventually lead to stomach problems. This experience pushed me to research on the early detection of renal stones from CT images using vision transformers (NOVAL of that time). I have the passion to wake up and work on sleepy nights with coffee. My goal is to bring innovation in healthcare.

WHY ME?

My background and skills make me uniquely suited to contribute to Ersilia's work. I have real time experience in prompting and being an Ex ML engineer I have solid foundations of ML, DL and gen AI. By participating in Outreachy internship, I not only hope to expand my horizons but also to make a tangible difference in the lives of those affected by neglected diseases. I am driven by the belief that by combining technology and community-driven efforts, we can save lives and improve access to healthcare services for all.

Use model through CLI

My 1/2 work of WEEK 1 with @ersilia-os

Recap

I went through about @ersilia-os and checked the website - truly AMAZED with the great minds work there. After that went through the HANDBOOK which they provided as contribution guide. As per instructions. I downloaded Ubuntu and installed. After that, I Installed Ersilia and all the dependencies. Installed Docker as well. Along that I Opened this issue! where I shared why I really want to contribute and be part of Ersilia as Outrechy Intern.

Installed the Ersilia Model Hub and test the simplest model

(ersilia) taliqamuhib@Taliqa-Muhib:~/ersilia$ # Halicin
ersilia api run -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"
# also works with the run command directly
ersilia run -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"
{
"input": {
    "key": "NQQBNZBOOHHVQP-UHFFFAOYSA-N",
    "input": "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]",
    "text": "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"
},
"output": {
    "outcome": [
        0.9924924
    ]
}
}
{
"input": {
    "key": "NQQBNZBOOHHVQP-UHFFFAOYSA-N",
    "input": "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]",
    "text": "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"
},
"output": {
    "outcome": 0.9924923777580261
}
}

faced problem while fetching Ersilia - I FIXED IT 🥇

Testing Ersilia with Docker

To test Ersilia with Docker, I installed docker for windows and funny thing happen, i was fetching model along the docker installation, in the end, there was option to restart, thought its just restart the program, but the laptop restarted! THE MOMENT I REALISED I .....

Week 2 - Get Familiar with Machine Learning for Chemistry

Recap

Before Starting Week 2 we had a meeting with Ersilia Team, they introduced them, and guided us how to contribute in the current contribution Period! They shared papers of the models with us to get our hand dirty on and get to know about ML for Chemistry.

Select a model from the list suggested in GitBook

this week tasks were pretty much interesting, have to select a model which kick your mind to dive more into it.

I selected Ersilia eso6oli - SolTransNet model of SolTranNet−A Machine Learning Tool for Fast Aqueous Solubility Prediction paper by Francoeur er al. - The reason is simple! JUST WANT TO KNOW SUGER SOLVES FASTER OR SALT! :). I really love to play with transformer based models.

SolTranNet, a molecule attention transformer MAT to predict aqueous solubility from a molecule's SMILESrepresentation. Actually its is Regression model, a Predicted LogS (log of the solubility) to filter out insoluble compounds! it is fined tuned with pertained model MAT, it apply self attention to a molecular graph representation of the molecule.

SolTranNet’s dependencies areRDKit17 (2017.09.1+), NumPy18 (1.19.3),PyTorch19(1.7.0+), and pathlib (1.0+). SolTranNet achieves a sensitivity of94.8% on the Predict Aqueous Solubility(SC2) data set and is competitive with the other methods submitted to the competition.

Data Sets. AqSolDB16 is the data set we utilized for training SolTranNet, as it was the largest publicly available set. ESOL data in orange in the given graph, was used while training the data model of MAT.

Screenshot 2024-03-10 214848

Ersilia model eso6oli Implementation.

I implemented in Colab - EASY AND FAST compared to local system.
Made Github Repository here - Took template inspiration from here.
As for data visualization, I used histogram, it was clearly describing the frequency VS log of solubility.
it was clearly mentioned in paper that to refactor the prediction as a classification task, threshold defined a soluble compound as a compound with log S > −4.
Moreover, this model predicts the solubility log but does not have a threshold to classify this compound is soluble or not!
As per the given reference_library.csv, I evaluated and visualized data in this figure.

download (2)

Here we can see that most of the compounds as log s < -4 in red almost 65% are insoluble in human body, - ! model biasness will be evaluated when i will run the Soltrannet and compared - FOR NOW I HAVE WORKED THIS -

References

Francoeur, P. G., & Koes, D. R. (2021). SolTranNet-A Machine Learning Tool for Fast Aqueous Solubility Prediction. Journal of chemical information and modeling, 61(6), 2530–2536. https://doi.org/10.1021/acs.jcim.1c00331

Hi @Talikamuhib good job so far! The reason the implementation in the Ersilia Model Hub does not generate predictions directly is because different users of the model might want to keep varying thresholds for binarizing the outcome.

Few comments on your work:

In your repo could you create a figures folder and save the figures there?
Perhaps you can try experimenting with different thresholds of solubility and comment on the results you obtain.

@DhanshreeA Thank you for the feedback! and that good job call literally 10X my motivation and initiated turbo mode to contribute. Thank you for clearing my ambiguity regarding models threshold!
I will implement your comments ASAP! I have to do load of work on it and try more channeled graphs and charts for better visualization.

PS. I want to get prediction on something which is totally new for model! as it is trained on load of data! i feel it has been already seen this! can you help me out in finding new compounds dataset of 2022-23-24 of smiles and its labels!

MY 3/4 Work of Week 2 with @Ersilia-os

Recap:

I have started to work on week 2, found some basic results and manage to fetch the selected model to docker! i can do prediction from CLI now! that's so cool! I tested model on Colab, found very basic results but i want to try via CLI and analyse! and i check that too whether the results get change or not!

Run predictions for the 1000 molecules, create the necessary plots and explain the results you are obtaining

The week is getting interesting and interesting as I am diving in. the aim of this week is to find whether the selected models are accurate and reproduceable. SO LONG STORY BEGINS to check the Model Bias - whether the model is generalizing properly or not! After that, its reproducibility - the trends and insights of predictions are similar to models paper and eos6oli results are giving similar kind of results or they are different? AND THE LAST BUT NOT THE LEAST "Performance" ! how model will react to unseen data?? let the story begin!

eos6oli

T1 Model bias : I ran the predictions in 1/4 week but could not able to give good insights as visualizations. POINT TO PONDER was that model was predicting Log S - measure of solubility in human body! In the paper, it was clearly mentioned that log S if is less than -4 is insoluble , if it is between -4 and -2 then partial soluble and lastly less than -2 its soluble! figures is classify among soluble, partial soluble and insoluble!

Histogram-soluble-insoluble-slighly soluble

For better visualization of the predicted results Pie chart can better describe the overall.

piechart

here we can see! most of the data samples are predicted to be insoluble. that could be further better visualize as

Histogram-Density-plot

Here we can see! most of the samples are falling in the -5.5 which is not too far from -4 (Threshold of soluble and insoluble)! the hump is showing that it is towards left! here my hypothesis is that the model is sightly biases. lets confirm our hypothesis and make it better!

SolTranNet

T2 Reproducibility: After reading paper couple of times, I got to know some important terms which i came across in highly school! I did the prediction from the same data used in Ersiliaeos6oli predictions. found out just that just minor difference in predictions. Overall,SoltranNet predictions were mostly similar to eos6oli predictions on same data.

In this figure, Soluble and Insoluble predictions are similar just .10% is different. histogram-2-SolTranNet

Similarly histogram-3-SolTranNet

Moreover, this figure is showing the majority of predictions at 5.5 histogram-all-SolTranNet

Overall distribution is also similar

piechart - SolTranNet

The main difference I got is the time! model predictions from ersilia eos6oli using colab is of 60.33 sec and for SoltranNetwas just 3 seconds.

Testing of TEST DATASET of SolTranNet

Before searching large dataset and typically consist of new compounds. I ran the predictions on TEST data of SolTranNet! these where the results where i compared with real experimented results and predicted results of SoltranNet and Ersilia eos6oli.

Experimented Y	SoltranNet	Eos6oli




	Accuracy: 0.894800120228434 Precision: 0.9035404624277457 Recall: 0.9678792569659442 F1 Score: 0.9346038863976084	Accuracy: 0.894800120228434 Precision: 0.9035404624277457 Recall: 0.9678792569659442 F1 Score: 0.9346038863976084

	AUROC for Solubility: 0.8775305174392753 R2 Score for Solubility: 0.24743684531626464	AUROC for eos6oli Solubility: 0.8775305174392753 R2 Score for eos6oli Solubility: 0.24743684531626464

As we can see here! SoltranNet and Ersilias's eso6oli are providing similar results. here we can see that the performance if both on test data is similar!

Hi @Talikamuhib awesome work so far! I apologize I missed your comment earlier:

PS. I want to get prediction on something which is totally new for model! as it is trained on load of data! i feel it has been already seen this! can you help me out in finding new compounds dataset of 2022-23-24 of smiles and its labels!

Can you confirm if you still need this and I can look into this. :)

Hi @DhanshreeA, Thank you for 100X times hyping my motivation! It would be great if you help me out to find them! As i am doing the predictions from test datset used in testing the model! If i would get it it will be novel experiment! And it will see whether it is still applicable for newly generated compunds or not! Moreover, kindly share your comments on the work i did so that i would work more to make it worthy contribution ti Ersilia.

Week 3 - Validate a Model in the Wild

Recap:

Interestingly (I WAS SO EXCITED TO WORK) I started to work on week 3 in week 2 - thought it is part of week 2, I have used the dataset here of SoltranNet testing! specially to ensure the DATA LEAK is as minimum as possible. I ran the test data just to get trends of whether both the models works similarly or not and perform some interesting visualizations. Moreover, to make AUROC Curves I used threshold of -4 solubility - as per given in paper! it is not possible to build AUROC Curves on regression task! so converted to binary classification and gather to results.

Find a suitable dataset with sufficient experimental results

Finding dataset was huge journey for me! but this made my life easy as piece of cake. Moreover i found the the data i am using AS WILD DATASET. which is a suitable dataset with sufficient experimental results and Clean and standardized.

After analysis of this table, I came out with the sum up to take Chembl dataset! which has less redundancy and differences of Logs value is also less.

Results

best matrix to evaluation of prediction is RMSE and Mean Absolute Error. so the results were

Mean Absolute Error (MAE): 1.3373231591422308 Root Mean Squared Error (RMSE): 1.7280959712008543

and i really dont wanted to share but feels like R2 is in negative R-squared value: -0.8145716984892839

As the dataset of Chembl was 3 times bigger than training data! if we train on more data we could get better results!!!!

Hi @Talikamuhib good work, however, I think there has been some confusion. I see that you have utilized the testing dataset from SolTraNet for the model validation task (ie Week 3 T3). Can you confirm if that is indeed the case?

I also have some comments around clean up:

Can you rename the data files and add a README within the data directory that mentions exactly which file is what? I am looking for something like: reference_library.csv or 1000_mols.csv (for task 1), test_data.csv (for task 2), and external_validation.csv (this is for task 3).
Regarding obtaining a dataset, I would suggest you visit any external databanks such as Chembl or Pubchem, and look for a few compounds (identified by InchiKey) that are neither in the training set of SolTraNet or the test set used for task 2. Make sure these have an experimentally calculated value. You don't need to bother with which year they are from, they should just not be repeated in the datasets you have used for tasks 1 and 2. Hope this helps!

Remember, you do not need to finish all the tasks. Creating a final application is more important!

Hi @DhanshreeA. thank you for the feedback.

your concern about using test dataset for T3 is right! preparation of the my notebook for wild dataset so that i when i get that, i will just run the commands and do the visualizations and comments.

As per your suggestions I have added the README.md.

I found very amazing page. it had the data. As per the article i found, the redundance matrix. where Aqsol and Chembl have just 1.76 % redundancy! so it is better to use. So used it!

Looks good @Talikamuhib please create your final application!

ersilia-os / ersilia

✍️ Contribution period: Taliqa Muhib #993