Closed miquelduranfrigola closed 1 month ago
I've gone through the repository, and have a question - the issues within the repo are for a future roadmap or is this still work under progress?
Hi @DhanshreeA , the current issues are for a future roadmap.
The only realistic way to tackle this will be via a meeting. Let's discuss it in the upcoming group meeting.
Hello @DhanshreeA,
Quick summary of our conversation earlier today. Please correct me if I am missing something:
eos3b5e
and eos4e40
, for example.Hi @miquelduranfrigola
Sharing a few quick updates here:
eos3b5e
, and the pipeline run was quite fast given thateos3b5e
is a very simple model. eos4e40
ran for roughly ~6 hours after which it timed out. (6 hours is apparently the limit for any workflow job in GH) Of course no predictions were saved in s3 as a consequence. I guess it is because chemprop is inefficient to run on CPU for 39988 smiles (partition size for each of the reference library partitions). I'm going to benchmark it locally. Thanks @DhanshreeA - this is useful.
We will always find cases where it is too slow, so in my opinion the best we can do is, at least, save what we have produced, and allow for a second run to complete the remaining. This is not built in yet, but certainly within reach. Moving to Spark or others is not an option now, unfortunately.
We definitely need to record in our AirTable the time it takes to make (for example) 1, 10, 100, 1000 predictions for each model in a standard runner. This is very valuable information for scheduling, and most likely we want to collect this info at model submission time. What do you think?
Hi @miquelduranfrigola and @DhanshreeA
This is very related to work done in the last internship and the current GDI engagement. Could we update the issue with the current state of affairs and suggest a plan for moving this forward/finishing it?
Agree. The best will be to close this issue and refer to everything that's happening in https://github.com/ersilia-os/model-inference-pipeline, including the major PRs and the issues
ok I close this issue
Summary
The Good Data Institute team has developed a solution to do model inference at scale using GitHub Actions. By default, we calculate 2M molecules available in ChEMBL. Results are stored in S3 buckets and eventually written in an AWS DynamoDB table.
Their work is nicely documented in the model-inference-pipeline repository.
Objectives
We need to make a plan of adoption as soon as possible, including:
Let's please use this thread to start a discussion around this.
Documentation