TypeError: Float types are not supported. Use Decimal types instead.

nicholas-kondal commented 3 months ago

Using model eos4zfy in this run, writing to DynamoDB fails on some workers.

Upon closer inspection, it appears that for the workers that failed, the output files in S3 contained no scores as outputs:

This is something that we may need to take into account before writing i.e. some models may not produce a score for a particular input:

Does the precalculation still get written (after e.g. a conversion back to a default value for the score)?
Should this be logged somewhere?

nicholas-kondal commented 3 months ago

@DhanshreeA This is the issue I shared with you today for your reference :)

Context for others: The predictions are successfully generated for all workers but some are blank (as shown above), and according to DA, this may be due to a request timeout (this particular model runs in an enviroment not managed by Ersilia so all workers sending requests to that same external environment may be causing model scores to not be generated for some workers, which is handled gracefully at the CLI level but not at the inference pipeline level yet)

miquelduranfrigola commented 3 months ago

Thanks @nicholas-kondal

Indeed, for some inputs, it is possible that a model does not produce an output. It may be due to timeouts or any other environment issue, but sometimes it is actually expected. For example, for some molecules, it is not possible to make some calculations.

I agree that we should log this is possible. I still don't know if the precalculation needs to be written. In theory, for timeout errors and related, in my opinion we should not write the precalculation. However, for cases where a null result is expected (e.g. a molecule for which, due to its chemical composition, it is not possible to get a value), then it would be best to keep the result. If a default value needs to be assigned, this should be a consensus (e.g. -9999999) but certainly not an average or any value derived from imputation methods (median, mode...).

Do you agree, @nicholas-kondal ?

nicholas-kondal commented 3 months ago

Thanks @miquelduranfrigola, I agree that this should be logged when it occurs and I think we could use the existing tracking/logging functionality to get any relevant information to send (via the API I'm working on to integrate the CLI and inference store) to store as metadata in the cloud for further analysis via Splunk or otherwise.

But I'm also thinking about whether or not we would need to modify the output files before sending anything to the API, based on the source of any null outputs i.e. either:

timeout/environment errors (in which case, they should likely be removed so that they're not written to the store)
a null output being expected (in which case, they should likely be written to the store, though perhaps with a value e.g. -9999999, like you mentioned)

Do you know if there is any pre-existing information that could help us distinguish between models that do or don't produce null outputs, or how we can tell which inputs are compatible with what model? Because if a null output is generated (by either the pipeline or, in the future, any individual user wanting to write their precalculations to the inference store) for a model that is expected to produce a non-null output, that may overwrite any legitimate calculations in the inference store.

miquelduranfrigola commented 3 months ago

These are cery relevant points @nicholas-kondal . @DhanshreeA is currently dealing with the tracking command (together with our interns), so let's coordinate so we don't duplicate efforts!

At the moment, I cannot think of a way to anticipate with 100% certainty whether a model is capable of producing null outputs or not. @DhanshreeA - any ideas?

As for overwriting legitimate calculations — in principle, I would say that if results are stored for a given input in the inference store, then we should not calculate it again and we should not overwrite (unless we force it for some reason). In that case, I would say that, as a rule, for any non-null value stored in the inference store, no re-calculations should be done (therefore mitigating the risk for overwriting with a null value).

Let me know what you think!

nicholas-kondal commented 3 months ago

Thanks @miquelduranfrigola, I'm not touching the tracking functionality for now though I've been keeping tabs on that work via the PRs in case we need it.

if results are stored for a given input in the inference store, then we should not calculate it again and we should not overwrite (unless we force it for some reason). In that case, I would say that, as a rule, for any non-null value stored in the inference store, no re-calculations should be done (therefore mitigating the risk for overwriting with a null value)

Just to clarify, would this extend to model hub users i.e. you would only like Ersilia staff to be able to write/update the inference store (manually or via automated triggers)? Or do you see model hub users having write access to the inference store but only for new predictions (while Ersilia staff has write+update access to all predictions)?

miquelduranfrigola commented 3 months ago

Thanks @nicholas-kondal - we were discussing this with @DhanshreeA today. For now, we will not allow users to write to the inference store (although, in the future, we certainly will).

Broadly speaking, this is our roadmap:

Write to inference store based on a reference library of compounds.
Write to inference store based on internal usage of the models, beyond the reference library.
Allow users to write to inference store using https://github.com/ersilia-os/ersilia-self-service (still under development). In this case, credentials are not an issue since we can use GitHub Secrets.
Allow users to write to inference store based on their local runs. In this case, we need to figure out how to deal with authentication.

I hope this makes sense?

nicholas-kondal commented 2 months ago

Makes sense, thanks!

ersilia-os / model-inference-pipeline

TypeError: Float types are not supported. Use Decimal types instead. #19