Closed jashapiro closed 1 month ago
I updated nextflow
to on the AWS workload machine to the latest version, 24.04.3, and triggered a manual run for workflow version v0.1.1 (only the real data, as the simulations had gone fine). That is currently running at https://cloud.seqera.io/orgs/CCDL/workspaces/OpenScPCA/watch/4Rrf1pyhMkOBw6 and seems to be working as expected, pulling images as expected.
Hoping that there was a bug in the previous nextflow version, though I can't rule out that the time delay might mean that we have gotten past any throttle limits before I began the run.
I did check the previous logs and confirmed that the token was being sent as expected.
The previous run is on track to complete, with only merge tasks remaining, despite a series of apparent failures for container pulls (zero execution time) during doublet detection. I am still not sure what is going on exactly, but I will keep this open for monitoring.
If we are running into pull limits, one solution may be to batch samples together to reduce the number of jobs.
Alternatively, if we can get a way to use the Fusion file system without relying on so heavily on the Wave system, that might be ideal.
I take it back. The job seems to have stalled, and we are still seeing container pull errors. I'm still not sure what is going on.
Looking at the actual error on AWS Batch, I see:
CannotPullContainerError: Error response from daemon: unknown: repository 'public.ecr.aws/openscpca/doublet-detection:v0.1.0' bad request (400)
So now I wonder if it might be a problem with pulling the base layer that the wave container is built on.
Are we maybe hitting a limit for the AWS public ECR? If so, I wonder if we can pass any additional permissions to the nextflow user. I will test adding ECR read permissions and see if that gets us anywhere.
I can confirm that all the doublet_detection jobs in https://cloud.seqera.io/orgs/CCDL/workspaces/OpenScPCA/watch/VcxEXOaURslUi have now actually finished, and without any docker pull errors.
So things seem to be good, for now.
This error seems to be related to throttling of the ECR pull requests from the public channel. I'm somewhat surprised they are throttling pulls to batch instances, but that is my best guess. I did add a few more permissions to the role that is used by the instances (SSM-core
), but I am not sure whether that made the difference.
If this problem recurrs/persists, I think the best option may be to try to reduce the total number of jobs that we send.
This w mean updating/modifying processes to allow them to take sets of samples, rather than only one at a time.
We could revert to expecting modules to take a project at a time, though this might create more unbalanced loads, or add some functionality to co create lists of values as inputs based on collate()
or buffer()
to create the sets of inputs, probably followed by a `transpose() operation. I haven't fully thought this through though!
We can reevaluated if/when we next see throttling errors, I expect
Looking at the actual error on AWS Batch, I see:
CannotPullContainerError: Error response from daemon: unknown: repository 'public.ecr.aws/openscpca/doublet-detection:v0.1.0' bad request (400)
Sadly, we are again hitting the same errors with the latest (staging) run. I think the next move is probably to adjust the workflows to try to batch jobs into larger work units. I expect to be able to limp along and rerun the workflow to completion with the current settings by running the simulated and real data separately, but this should be addressed for the future.
I have been testing this in https://github.com/jashapiro/nf-wave-test and was able to regenerate the behavior yesterday with the same CannotPullContainerError
. At the time I was (I think) at the throttles-somteimes tag. I submitted 1000 jobs cleanly in an initial, but then when I submitted more I immediately started to get the pull error.
I continued onward to do a bit more testing, which included some consolidations in preparation for submission of a bug report, and testing to see if we would still have the same error using our previous AWS batch stack. When I started to test those today, I was excited to see that the error was no longer coming up, meaning that it was likely our new batch stack. I went back to our new batch stack though, and now I am longer getting errors, despite submitting thousands of jobs.
I reverted to the tagged version when I was previously getting errors, and I have not been able to recreate errors.
At some point I had also update Nextflow to 24.04.4 (from 24.04.3), and while I thought that this could be where the fix happened, when I reverted to the previous version, I was still unable to recreate the error with either version of the profiles.
So I am now back to thinking this was a bug at the Seqera end, and perhaps they saw it somehow in logs or other usage. It is also possible that the fix was in a plugin which was not redownloaded when I downgraded, as I don't know how plugins are handled exactly. To account for that eventuality, I upgraded nextflow on the workload server, and will monitor to see if the issue remains resolved.
In the meantime, I found a few places where we seem to have settings that are no longer required, which I will be submitting separately.
Okay, I got it failing again, at the following commit: 41345f8, though it did initially fail with 4f6ec98 when I was wondering if module-specific binaries might be required.
It just seems that it may take a lot of jobs (or many runs) to induce failure. I have saved a number of log files now, so it should be possible to start to send out some inquiries with tests.
Okay, after discussion with the Seqera team on Slack, it does seem that the issue is likely to be ECR API rate limiting. Luckily, this should be fairly straightforward to solve? According to AWS docs that I can see, we will need an account with ecr-public:GetAuthorizationToken
and sts:GetServiceBearerToken
to allow it to login to the public ECR for authenticated pulls, which vastly increases the rate limit. (I had kind of assumed that maybe other ecr-public:Get/Describe
privs would be required, but maybe not, since it is public? we can test this) As far as I can tell, this does not need to be an account with any particular access to other resources, which may be useful as I believe we do need to be able to provide an access key and secret key, as SSO is not supported (though we can use a role, if desired).
Tagging @davidsmejia and @jaclyn-taroni for advice/thoughts on next steps.
We have added credentials to the Seqera account used by the batch workflow to allow login to the public ECR, and that seems to have solved the wave container issue! 🎉
When trying to deploy the v0.1.0 and v0.1.1 releases, we ran into errors where containers were not being pulled as expected, leading to repeated failures. Test runs do just fine, but the later runs started to fail, which seems to suggest that there might still be an issue with rate limits related to wave containers.
While we are populating the
TOWER_ACCESS_TOKEN
environment variable on launch, and we know this is getting accepted because we can monitor the runs on Seqera cloud, I wonder if we might actually need to populatetower.accessToken
configuration variable as well/instead for wave containers specifically. While this would surprise me, I can't think of another reason at the moment.There is also a possibility that there is an error somehow related to wave pulling containers from ECR, though again this seems unlikely to be a root cause.
My plan is to wait and try to deploy the 0.1.1 release one more time, but to first have a more careful look at the log files to try to determine if there is something more specific that I can find.
If this can not be resolved, we may need to abandon the use of wave containers, but I am hoping it does not come to that, and I will try to ask Seqera for assistance before we get to that stage.