🐕 Batch: Ersilia CLI Performance Improvement

DhanshreeA commented 1 week ago

Summary

@Abellegese is looking into improving the design architecture of core components within the Ersilia CLI, as well as replacing sequential execution with asynchronous approaches. On top of this this batch will be used to report results from benchmarking and profiling the current implementation, with reporting updates as we improve the architecture and implement asynchronous execution within flows.

Objective(s)

No response

Documentation

No response

Abellegese commented 6 days ago

The first objectives to improve performance on the Ersilia CLI are summarized below:

1) Improving performance of fetch command. -Looking for unnecessary check ups, high latency interfaces, and asynchronizeable IOs 2) Improving performance of serve command. -On the preliminary level, this does not seem to have issues but I will look into ways to make it light (function stack calling is crazy everywhere). 3) Improving performance of run command. -This probably the most important CLI that needs performance improvement (speed mainly). -This command would be benefited from concurrency, async ops, high speed module binding and simplified preprocessing steps.

The second objective is to improve performance on the IO operations on the global level.

1) Defining extensive IOs as async and concurency ops 2) Registering common IOs as singleton 3) Allowing template based Aync Ops creation with method specified on 2) 4) Module binding on ersilia command call is appeared to be some performance overhead, so making this fast on the global level.

....Will continue as move on.

aderemi1224 commented 6 days ago

/approve

Abellegese commented 6 days ago

Preliminary Report

Fetch CLI

I analyzed this command using [py-spy and speedscope.app]. The results shows that this command has a ton of unnecessary (costly) checkups.

1) As @DhanshreeA mentioned on issue #1259, the decision flow for model fetching appears to have high performance bottleneck.

I checked the network latency between the dockerhub and airtable during checkup and fetching, the airtable had high network latency.

This is the profiling results for original decision flow when model fetching (CPU time). original_deicision_flows

And this is the simplified decision flow I tried to see how it goes. simplified_decision_flows

Opinions

Based on the second (slightly improved flows), I would use prior knowledge(Where each models are located and which models repos has high bandwidth or low latency, which was docker in my case) about the model repository and use that information and store it into a dictionary data structures (its fast in this case: key is model id and keys can be a list of info about the model). This dictionary can be used instead of the heavy checkups and loops. Both docker check and hosted can be replaced with this.

DhanshreeA commented 6 days ago

/approve

@aderemi1224 only maintainers have the permissions to dispatch approve workflows. 😅 And generally we want to run them on Model Request types of issues.

DhanshreeA commented 6 days ago

@Abellegese and @miquelduranfrigola honestly I am completely in favor of removing Airtable reads to fetch metadata. It is fine that we update Airtable when a model is submitted, but then we should continue to read only from S3 especially since Airtable gets dumped into S3 anyway. To make sure that the data is consistent across both, we could use webhooks instead of cron job as we are doing now.

aderemi1224 commented 6 days ago

/approve

@aderemi1224 only maintainers have the permissions to dispatch approve workflows. 😅 And generally we want to run them on Model Request types of issues.

okay @DhanshreeA. Thank you so much

Abellegese commented 5 days ago

Further Report on Fetch

I did further experiment and found the following results:

1) Standard example can be skipped cause the user check for themselves on the run command. 2) The DockerRequirement class performs several docker commands, one of them is is_active() -is_working and the class itself. The output response from the subprocess can be used as exception if docker is not active when pulling. 3) In ModelPuller -> pull -> is_available_locally() -> SimpleDocker ---this itself perform docker command that has performance cost.

is_available_in_dockerhub() -we can also use prior knowledge by registering image info in dictionary when pulled then use that to show the status 4) In ModelRegister -> register() func we register twice-1) we register from dockerhub, 2) we register not from hosted and vice versa, dont get this one? 5) This is also not improvement but NotImplementation kind of bugs, in ModelDockerHubFetcher, we did not use those three functions
- is_docker_installed
- is_docker_active
- is_available I dont know why? These functions use DockerRequirement? 6) Implemented concurent and normal asynchronous operation on

 async def fetch(self, model_id):
        # if not DockerRequirement().is_active():
        #     raise DockerNotActiveError()

        mp = ModelPuller(model_id=model_id, config_json=self.config_json)
        self.logger.debug("Pulling model image from DockerHub")
        await mp.async_pull()
        mr = ModelRegisterer(model_id=model_id, config_json=self.config_json)

        await asyncio.gather(
            mr.register(is_from_dockerhub=True),
            self.write_apis(model_id),
            self.copy_information(model_id),
            self.modify_information(model_id),
            self.copy_metadata(model_id),
            self.copy_status(model_id),
            self.copy_example_if_available(model_id)
        )

pulling is also asynced.

Overall

Measuring speed is quite tricky since it depend on several factors. But I performed repeated tests. Modifying the above specified improvements I got the result below:

On my internet speed and hardware, the docker pulling takes about ~45sec on average. The rest of the code after pulling takes about 35sec, a total of 80sec. After improvement its total of 54sec. The thing is now the overhead is on the internet speed (the pulling operation).

The reduction in CPU time on the code on three improvement stage looks like below:

Original code:

Improvement one: improvement-one

Improvement two: improvement-two

Improvement three: This one has both concurrency and async ops improvement-three

Conclusion

As we can see from the figures above (region marked by black box), the code execution time reduced. If one can have better internet speed that affect the pulling, the overall fetch CLI can be improved.

Abellegese commented 5 days ago

Preliminary Report

Serve CLI

The serve CLI doesnt have significant overhead 😐.

Abellegese commented 5 days ago

Preliminary Report

Run CLI

On the run command, I did a basic reverse engineering to get the api key,url and some other thing. I set the benchmark on the top of the main Ersilia run command objective, that is given input get computed output efficiently.

1) Assuming we have the correct smiles input and predefined free port, only need the code below to connect and get the result from the running container.


import requests
import time
st = time.time()
url = "http://0.0.0.0:40937/run"
input = (
               {
                 'key': 'NQQBNZBOOHHVQP-UHFFFAOYSA-N',
                 'input': 'C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]',
                 'text': 'C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]'
              },
      )
response = requests.post(url, json=input)
print(response.json())
et = time.time()
print(f"Normal API time: {et-st:.6f} second")
# Normal API time: 0.316354 second

This takes about 0.3sec for a single smiles which is faster.

2) Using an Ersilia run command

 ersilia run -i "C1=C(SC(=N1)SC2=NN=C(S2)N)[N+](=O)[O-]"

This takes about ~40sec. This is about 126x slower than the minimum requirement the CLI should achieve, as specified in 1).

Conclusion on Preliminary Report

Fortunately the api seems very nice and fast. This indicates that the data preprocessing, stacked function calls, IOs and many unnecessary application layers caused this sever performance reduction.

Abellegese commented 4 days ago

Function Improvement Comparison

I reimplemented a function below that searches the model json data and return the url. The original implementation has a time complexity of O(n). Using dictionary data structure coupled with memoization that brought the time complexity to O(1). I presented the original and the modified version of the code below. The methods impemented in the ModelURLResolver class.

Original code

def _find_url_using_s3_models_json(self, model_id):
    self.logger.debug(
        "Trying to find an available URL where the model is hosted using S3 Models JSON"
    )
    for mdl in self.ji.items():
        if mdl[IDENTIFIER] == model_id:
            if HOST_URL in mdl:
                url = mdl[HOST_URL]
                return url
            else:
                self.logger.debug(
                    "No hosted URL found for this model in S3 Models JSON"
                )
                return None
    self.logger.debug("Model was not found in S3 Models JSON")

Optimized code

self.models_cache = None

def _cache_models(self):
    if self.models_cache is None:
        models = self.ji._read_json_file()
        self.models_cache = {mdl["Identifier"]: mdl for mdl in models}

def _find_url_using_s3_models_json(self, model_id):
    """"Search with memoization"""
    self.logger.debug(
        "Trying to find an available URL where the model is hosted using S3 Models JSON"
    )
    self._cache_models()
    model = self.models_cache.get(model_id)

    if model:
        if "Host URL" in model:
            return model["Host URL"]
        else:
            self.logger.debug(
                    "No hosted URL found for this model in S3 Models JSON"
                )
            return None
    self.logger.debug("Model was not found in S3 Models JSON")

I presented the pytest-benchmark results below.

benchmark-find_from_s3

A new implementation is a gazillion times faster.

ersilia-os / ersilia