ersilia-os / aws-utils

Utility scripts to interact with AWS
GNU General Public License v3.0
0 stars 0 forks source link

EC2 Instance for ZairaChem Model training #2

Open GemmaTuron opened 1 month ago

GemmaTuron commented 1 month ago

EC2 instance to train ZairaChem models in the cloud and save resources & avoid loadshedding :)

sucksido commented 1 month ago

Hi @GemmaTuron , I need the following roles/policies for ec2 to able to configure SSH: ec2-instance-connect:SendSSHPublicKey ec2:DescribeInstances

sucksido commented 1 month ago

{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": "ec2-instance-connect:SendSSHPublicKey", "Resource": "arn:aws:ec2:region:account-id:instance/*", "Condition": { "StringEquals": { "aws:ResourceTag/tag-key": "tag-value" } } }, { "Effect": "Allow", "Action": "ec2:DescribeInstances", "Resource": "*" } ] }

sucksido commented 1 month ago

image

GemmaTuron commented 1 month ago

Hey @sucksido I get this error: The service ec2 does not support specifying a Region in the resource ARN

sucksido commented 1 month ago

There seems to an issue with the region, we can try this: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "ec2-instance-connect:SendSSHPublicKey", "Resource": "arn:aws:ec2:instance/*", "Condition": { "StringEquals": { "aws:ResourceTag/tag-key": "tag-value" } } }, { "Effect": "Allow", "Action": "ec2:DescribeInstances", "Resource": "*" } ] }

sucksido commented 1 month ago

I have successfully launched an EC2 instance to run ZairaChem. I am currently working on configurations and dependency installations, and will begin testing after all is working.

GemmaTuron commented 1 month ago

Should I still try the above permissions?

GemmaTuron commented 1 week ago

Hi @sucksido When you can please post an update on the status of this

sucksido commented 1 week ago

Update: Zairachem has been set up on an EC2 instance, I will share the log in details and the instructions privately, Now I am busy training models and encountred a Meta data issue which Jason previously raised, I am going to fix this manually today and continue testing

sucksido commented 1 week ago

Successfully trained model on Zairachem on our AWS ECS instance. I ran the following commands:

`conda activate zairachem

cd zaira-chem

zairachem fit -i /home/ec2-user/amr_small_train.csv -c 0.1 -d low -m /home/ec2-user/zairachem_models

zairachem predict -i /home/ec2-user/amr_small_test.csv -m /home/ec2-user/zairachem_models -o /home/ec2-user/zairachem_test_output`

sucksido commented 1 week ago

To log into the EC2 instance:

For Linux Instances:

sucksido commented 1 week ago

@JHlozek please see above comments, I have managed to train the models and run predictions, I have shared the log ins with you privately to test. Please let me know how it goes.

GemmaTuron commented 6 days ago

Let's see how much a fit command takes with 2000 mols - also good to know the space needed for different model sizes - how do we scale the container size automatically? @JHlozek please pass to @sucksido some datasets for testing

JHlozek commented 6 days ago

Hi @sucksido, here are two expanded train/test sets from the same original Novartis_3D7 set:

Novartis_3D7_2k_train.csv Novartis_3D7_2k_test.csv

sucksido commented 5 days ago

Thanks @JHlozek , I will train these and give feedback

sucksido commented 4 days ago

The training of models is still running since ~12:00 mid day today, when it's done I will run the predictions command

GemmaTuron commented 4 days ago

Hi @sucksido do you have the logs of the run? I find it surprising that it takes so long to train a model on 2000 molecules

sucksido commented 4 days ago

Hi @GemmaTuron

Fit command ran from : 12:05 to 17:50 Predict command ran from: 20:30 - 00:51

This is all for 2000 molecules

GemmaTuron commented 4 days ago

mm trying to understand the costing. @JHlozek or @sucksido did you work on the EC2 Instance early in the week? From Monday to Wednesday: 4 USD Thursday: 0.7 USD Does this mean 1 model costs around 1USD?

sucksido commented 4 days ago

@GemmaTuron i didin't do much work on it early this week but it's always running, we don't switch it off

GemmaTuron commented 4 days ago

But in principle, it should have a spike yesterday because you were using it? I am trying to understand baseline cost of having it on vs using it Also, as we do not train models that often, we need to have a way of switching this on and off - how can we go about it?

sucksido commented 4 days ago

Agreed, we can simply turn off the instance when we are not training models and only switch it on when we need it. happy to do that.