CINECA-project / wp4-federated-joint-cohort-analysis

6 stars 0 forks source link

Specifying memory requirements causes Nextflow TESK tasks to hang #14

Closed tskir closed 3 years ago

tskir commented 4 years ago

I submitted a test "Hello world" job through Nextflow yesterday. After more than 20 hours, it is still running: https://csc-tesk-test.rahtiapp.fi/v1/tasks?view=FULL (look for task ID task-805fbf57).

Of course, that job did not have a chance to run successfully, because it included input/output URLs from my local filesystem (/home/ktsukanov/...). However, it seems to me that the jobs shouldn't get stuck like this anyway, they should fail with an error or something.

lvarin commented 4 years ago

Could you try now with: https://csc-tesk-cineca.c03.k8s-popup.csc.fi

Some things have changed since April

tskir commented 4 years ago

@shukapoo @lvarin Actually this issue still persists, even when I run a workflow from inside the Kubernetes. Oddly, the tutorial.nf runs, but the main.nf from this repository hangs forever, even though they are both very simple.

I did some experiments and discovered that the culprit is the memory requirement line in main.nf:

    memory '100KB'

If you comment out this line, everything runs in a few minutes. If you leave that in, the task will hang for seemingly forever.

I wonder if this problem is in TESK itself, in TESK configuration for this specific, or in Nextflow's TESK implementation. I guess we'll need to do more debugging to find out.

lvarin commented 4 years ago

I will check.

Could you paste it the command to run Nextflow with the repository? (faster than me searching for it)

lvarin commented 4 years ago

Found it:

nextflow run main.nf -with-docker centos -w /mnt
tskir commented 4 years ago

Yes that's right. The exact protocol I'm following is as described in #15.

tskir commented 4 years ago

(Although, well, in #15 the memory requirement is already disabled, so you'll have to re-enable it in order to reproduce)

lvarin commented 4 years ago

Found the problem:

  Warning  FailedCreate  1m    job-controller  Error creating: pods "task-df1fb0a7-ex-00-fhrwh" is forbidden: [minimum memory usage per Pod is 6Mi, but request is 0., minimum memory usage per Container is 4Mi, but request is 0.]

In this case the pod never gets created, and the task never ends.

And this shows two problems, one that TESK is rounding 100K to 0G, I think this is easy solvable. But problem number two is how to then make the request with a proper higher value. I can see two options, either TESK reports this error back to Nextflow and the user will have to rise the limit, either TESK is aware of the limit and raises automatically. I prefer option one, as this way more corner cases can be covered.

Anyways, this is a strange error: "You ask for too little".

lvarin commented 4 years ago

I am doing some random tests, and:

Do you know anyway to see the call that nextflow does to TESK API? In principle TESK takes the numbe of GB in a double.

lvarin commented 4 years ago

For what I can see from TESK side, memory is sent as 0.0 by Nextflow:

"resources":{"cpu_cores":1,"ram_gb":0.0}
lvarin commented 4 years ago

The bug is documented in the code: https://github.com/nextflow-io/nextflow/blob/master/modules/nf-ga4gh/src/main/nextflow/ga4gh/tes/executor/TesTaskHandler.groovy#L206

tskir commented 4 years ago

Proposed fix available here: https://github.com/shukapoo/nextflow/compare/091cca2346c65078f3089b17da0189d23bb7964f..master, @shukapoo to submit the PR into the Nextflow repository

tskir commented 4 years ago

PR by @shukapoo submitted here: https://github.com/nextflow-io/nextflow/pull/1689, but the maintainer of Nextflow pointed out that the changes might break other (non-GA4GH) components, so this will have to be rewritten to only affect the GA4GH case

lvarin commented 4 years ago

So if I understood, he want us to change the method getResources in class TesTaskHandler to not use toGiga from MemoryUnit but write our own in TesTaskHandler

I think his suggestion makes sense, but wouldn't it be better to have this new method in MemoryUnit itstead and call it toGigaDouble or something like that?

tskir commented 4 years ago

Yes, looks like Paolo favours changes which are strictly local to the GA4GH package at this point

tskir commented 3 years ago

Compiled the most recent version of Nextflow (21.01.0-edge) on the pod and verified that it now handles the memory requirements properly. Closing this ticket.