aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
411 stars 184 forks source link

The transport plugin should be responsible for determining the maximum number of times a transport task can be tried, instead of `aiida` profile. #6487

Closed khsrali closed 22 hours ago

khsrali commented 2 days ago

Trial attempts may be ineffective in certain transport plugins, such as aiida-firecrest. If I understood correctly this option can be set to zero for the profile in src/aiida/manage/configuration/config.

    transport__task_maximum_attempts: int = Field(
        5, description='Maximum number of transport task attempts before a Process is Paused.'
    )

However, it would be preferable to leave it in the hands of the transport plugin, as multiple transport plugins (including SSH) may coexist in the same profile.

sphuber commented 1 day ago

Trial attempts may be ineffective in certain transport plugins, such as aiida-firecrest.

Why is this? Isn't it possible that your internet connection temporarily drops, or that the FireCREST service is (temporarily) unavailable? These would be transient problems that would be helped from the expontentiall backoff mechanism just as it does for the SSH transport, doesn't it?

khsrali commented 1 day ago

I thought the main idea of exponential backup was to avoid being banned from HPC by sending too many requests.

In FirecREST they have a similar concept, rate-limit, but it's handled internally by pyfirecrest which aiida-firecrest depends on.

I understand and agree that still there are other things could fail, like internet connection, FirecREST server being done temporarily. However, this doesn't need to be exponential.

So I think it make sense if the backoff mechanism (series of seconds 1, 2, 4, 8, 16 or 1,1,1,1) would be set from the transport plugin / computer settings.

sphuber commented 1 day ago

I thought the main idea of exponential backup was to avoid being banned from HPC by sending too many requests.

No, that is the safe_interval setting which is set on the transport. The EBM is done for all CalcJobs and is to ensure that an exception in a transport operation that is due to a transient problem (such as connection problems) doesn't except the entire process. So this has nothing to do with rate limiting

khsrali commented 22 hours ago

Alright, I think we already set safe_interval=0 in aiida-firecrest, So I think we can close here. Thanks a lot @sphuber for clearing that up,