Non-equilibrium cycling results are too large to use

jthorton commented 2 weeks ago

In broad terms, what are you trying to do? We have been testing using the new non-equilibrium cycling protocol on alchemiscale and have run the tyk2 system with network key AlchemicalNetwork-f1ed4805ceeb843cd7a587f41bfdfade-asap-public-tyk2_neq_testing_2 which ran fine but we are unable to pull down the results without running out of memory.

After searching through the S3 bucket I found a result json which is almost 800MB and can be pulled down locally using the following. After pulling down the JSON the final result is around 200KB and only contains information on the forward and reverse work and seems to be missing the input components which are needed to work out in what phase the transformation was run (possible issue with the gather method for this protocol?). If I manually pull the json file from S3 and load it up I can see over 100 copies of the protein has been saved, one for each protocol_unit which might indicate that the keyed_chain representation has not been used for this result or has failed to duplicate the protein objects.

I think there are two main issues: 1) we need to shrink the results and deduplicate the protein and small molecule data 2) we need to ensure that the input component information is not lost when gathering the result.

from alchemiscale import AlchemiscaleClient
import os
transform = "Transformation-228318ad34503037d7b8e683c60b3682-asap-public-tyk2_neq_testing_2"
client = AlchemiscaleClient(api_url="https://api.alchemiscale.org", identifier=os.environ["ALCHEMISCALE_ID"], key=os.environ["ALCHEMISCALE_KEY"])

# this takes around 5 mins for a single result
raw_result = client.get_transformation_results(transform)
# check the length of the json of the object
len(json.dumps(raw_result.to_dict(), cls=JSON_HANDLER.encoder))

191437 ~ 200KB

jthorton commented 2 weeks ago

It looks like KeyedChain just wasn't used when saving this result as if I save it locally using the KeyedChain method the JSON is only 2.3MB.

dotsdl commented 2 weeks ago

Thanks for this @jthorton! We plan to extend our use of KeyedChains to ProtocolDAGResults (along with compression) in #220, which is currently slated for the next major release. This should address these crazy size issues.

As for the input components, if you are pulling these out of ProtocolResult.data, the structure of that property is not guaranteed to be similar across different Protocols. Protocol authors are allowed to choose its contents and how it is structured, and they will make different choices.

The full inputs for any ProtocolDAGResult/ProtocolResult are found in their corresponding Transformation. We recommend using that object to introspect the content and form of the input components.

jthorton commented 2 weeks ago

Thanks very much for the help @dotsdl I'll close this as it's a duplicate of 220!

OpenFreeEnergy / alchemiscale

Non-equilibrium cycling results are too large to use #326