Closed jthorton closed 2 weeks ago
It looks like KeyedChain just wasn't used when saving this result as if I save it locally using the KeyedChain method the JSON is only 2.3MB.
Thanks for this @jthorton! We plan to extend our use of KeyedChain
s to ProtocolDAGResult
s (along with compression) in #220, which is currently slated for the next major release. This should address these crazy size issues.
As for the input components, if you are pulling these out of ProtocolResult.data
, the structure of that property is not guaranteed to be similar across different Protocol
s. Protocol
authors are allowed to choose its contents and how it is structured, and they will make different choices.
The full inputs for any ProtocolDAGResult
/ProtocolResult
are found in their corresponding Transformation
. We recommend using that object to introspect the content and form of the input components.
Thanks very much for the help @dotsdl I'll close this as it's a duplicate of 220!
In broad terms, what are you trying to do? We have been testing using the new non-equilibrium cycling protocol on alchemiscale and have run the tyk2 system with network key
AlchemicalNetwork-f1ed4805ceeb843cd7a587f41bfdfade-asap-public-tyk2_neq_testing_2
which ran fine but we are unable to pull down the results without running out of memory.After searching through the S3 bucket I found a result json which is almost 800MB and can be pulled down locally using the following. After pulling down the JSON the final result is around 200KB and only contains information on the forward and reverse work and seems to be missing the input components which are needed to work out in what phase the transformation was run (possible issue with the gather method for this protocol?). If I manually pull the json file from S3 and load it up I can see over 100 copies of the protein has been saved, one for each
protocol_unit
which might indicate that thekeyed_chain
representation has not been used for this result or has failed to duplicate the protein objects.I think there are two main issues: 1) we need to shrink the results and deduplicate the protein and small molecule data 2) we need to ensure that the input component information is not lost when gathering the result.