Closed tskir closed 3 years ago
Could you try now with: https://csc-tesk-cineca.c03.k8s-popup.csc.fi
Some things have changed since April
@shukapoo @lvarin Actually this issue still persists, even when I run a workflow from inside the Kubernetes. Oddly, the tutorial.nf runs, but the main.nf from this repository hangs forever, even though they are both very simple.
I did some experiments and discovered that the culprit is the memory requirement line in main.nf
:
memory '100KB'
If you comment out this line, everything runs in a few minutes. If you leave that in, the task will hang for seemingly forever.
I wonder if this problem is in TESK itself, in TESK configuration for this specific, or in Nextflow's TESK implementation. I guess we'll need to do more debugging to find out.
I will check.
Could you paste it the command to run Nextflow with the repository? (faster than me searching for it)
Found it:
nextflow run main.nf -with-docker centos -w /mnt
Yes that's right. The exact protocol I'm following is as described in #15.
(Although, well, in #15 the memory requirement is already disabled, so you'll have to re-enable it in order to reproduce)
Found the problem:
Warning FailedCreate 1m job-controller Error creating: pods "task-df1fb0a7-ex-00-fhrwh" is forbidden: [minimum memory usage per Pod is 6Mi, but request is 0., minimum memory usage per Container is 4Mi, but request is 0.]
In this case the pod never gets created, and the task never ends.
And this shows two problems, one that TESK is rounding 100K to 0G, I think this is easy solvable. But problem number two is how to then make the request with a proper higher value. I can see two options, either TESK reports this error back to Nextflow and the user will have to rise the limit, either TESK is aware of the limit and raises automatically. I prefer option one, as this way more corner cases can be covered.
Anyways, this is a strange error: "You ask for too little".
I am doing some random tests, and:
Do you know anyway to see the call that nextflow does to TESK API? In principle TESK takes the numbe of GB in a double.
For what I can see from TESK side, memory is sent as 0.0 by Nextflow:
"resources":{"cpu_cores":1,"ram_gb":0.0}
The bug is documented in the code: https://github.com/nextflow-io/nextflow/blob/master/modules/nf-ga4gh/src/main/nextflow/ga4gh/tes/executor/TesTaskHandler.groovy#L206
Proposed fix available here: https://github.com/shukapoo/nextflow/compare/091cca2346c65078f3089b17da0189d23bb7964f..master, @shukapoo to submit the PR into the Nextflow repository
PR by @shukapoo submitted here: https://github.com/nextflow-io/nextflow/pull/1689, but the maintainer of Nextflow pointed out that the changes might break other (non-GA4GH) components, so this will have to be rewritten to only affect the GA4GH case
So if I understood, he want us to change the method getResources
in class TesTaskHandler
to not use toGiga
from MemoryUnit
but write our own in TesTaskHandler
I think his suggestion makes sense, but wouldn't it be better to have this new method in MemoryUnit
itstead and call it toGigaDouble
or something like that?
Yes, looks like Paolo favours changes which are strictly local to the GA4GH package at this point
Compiled the most recent version of Nextflow (21.01.0-edge) on the pod and verified that it now handles the memory requirements properly. Closing this ticket.
I submitted a test "Hello world" job through Nextflow yesterday. After more than 20 hours, it is still running: https://csc-tesk-test.rahtiapp.fi/v1/tasks?view=FULL (look for task ID
task-805fbf57
).Of course, that job did not have a chance to run successfully, because it included input/output URLs from my local filesystem (
/home/ktsukanov/...
). However, it seems to me that the jobs shouldn't get stuck like this anyway, they should fail with an error or something.