HumanCellAtlas / secondary-analysis

Secondary Analysis Service of the Human Cell Atlas Data Coordination Platform
https://pipelines.data.humancellatlas.org/ui/
BSD 3-Clause "New" or "Revised" License
3 stars 2 forks source link

Look into long running analysis workflows #302

Closed kbergin closed 4 years ago

kbergin commented 5 years ago

At the scale testing F2F we observed some very long running workflows that were A) getting stuck running a particular task, and then getting B) pre-empted and taking even longer to run. We noticed this particularly often with Hisat2, but once with RSEM as well. As an example, most workflows ran for 40min - 1.5 hours and some workflows that got stuck like this did eventually complete on their own but took over 3 hours to run. We did up the memory on a branch to try and see if that fixed it. We suspect it helped with the tasks that got stuck on collectmultiplemetrics as we realized that task was setting memory in the java command hardcoded and that was hitting that memory limit.

We also tried updating the memory for hisat2, but didn't actually observe it using all of the original memory and don't suspect that helped. We should spend more time digging into these long running workflows.

AC:

┆Issue is synchronized with this Jira Bug

dshiga commented 5 years ago

Seeing exit status 137 (indicating out of memory) in HISAT2PairedEnd.log for some jobs, for example for attempt 2 of workflow be393d27-f2b9-492b-a007-e44f7cb8b32a.

This job ran with 30 GB of memory (we've been experimenting with different memory amounts).

2018/09/13 19:35:13 I: Sending SIGINT to 4368 2018/09/13 19:35:13 I: Calling SetOperationStatus(start-shutdown) 2018/09/13 19:35:13 I: SetOperationStatus(start-shutdown) succeeded 2018/09/13 19:35:22 E: command failed: Time loading forward index: 00:00:05 Time loading reference: 00:00:01 Multiseed full-index search: 00:01:57 HISAT2 summary stats: Total pairs: 1493128 Aligned concordantly or discordantly 0 time: 311528 (20.86%) Aligned concordantly 1 time: 436907 (29.26%) Aligned concordantly >1 times: 740922 (49.62%) Aligned discordantly 1 time: 3771 (0.25%) Total unpaired reads: 623056 Aligned 0 time: 540520 (86.75%) Aligned 1 time: 27422 (4.40%) Aligned >1 times: 55114 (8.85%) Overall alignment rate: 81.90% Time searching: 00:01:58 Overall time: 00:02:11 (exit status 137)

This is for bundle 88399594-4c45-4e8a-86ab-7796638badbb in integration. Here is the cell suspension metadata:

{ "describedBy": "http://schema.integration.data.humancellatlas.org/type/biomaterial/8.6.1/cell_suspension", "schema_type": "biomaterial", "biomaterial_core": { "biomaterial_id": "GSM2243625", "biomaterial_name": "Single cell from Tumor,1001000175.C5", "ncbi_taxon_id": [ 9606 ] }, "plate_based_sequencing": { "plate_id": "1001000175", "well_id": "C5" }, "genus_species": [ { "text": "Homo sapiens", "ontology": "NCBITaxon:9606" } ], "selected_cell_type": [ { "text": "astrocyte", "ontology": "CL:0000127" } ], "total_estimated_cells": 1, "provenance": { "document_id": "5268f190-89de-4db6-a5d1-1a61f04fa914", "submission_date": "2018-09-13T17:03:50.564Z", "update_date": "2018-09-13T17:11:43.171Z" } }

dshiga commented 5 years ago

RSEMExpression had two attempts in this workflow - the first seems to have taken more than an hour: 1ed1af7f-7796-4ba0-b603-95c8915e83d8. This could be a memory thing.

Long running HiSAT2PairedEnd, but log shows memory usage was low: 99d7c127-fbc4-45d1-97b4-ad12a64c0a45

CollectMultipleMetrics job that ran out of memory: 6feb4fb9-35ad-4755-bfa1-dee876ee215b

INFO 2018-09-13 18:58:58 SinglePassSamProgram Processed 7,000,000 records. Elapsed time: 00:01:47s. Time for last 1,000,000: 2s. Last read position: chr21:8,438,305 /cromwell_root/monitoring.sh: line 23: free: command not found /cromwell_root/monitoring.sh: line 27: free: command not found INFO 2018-09-13 18:59:08 SinglePassSamProgram Processed 8,000,000 records. Elapsed time: 00:01:58s. Time for last 1,000,000: 10s. Last read position: GL000220.1:109,379 (exit status 137)

Long running but successful on first attempt CollectMultipleMetrics, ran more than an hour: 62781bce-80f6-41b4-a944-923c4a53114a

Create submission took 45 minutes here: 5da0bd65-e255-4b35-82d5-8f7e3dd68a1a

Hisat2 job that ran out of memory: b17e6aea-77ec-48ec-a215-4ecb596038fe

2018/09/13 19:27:45 I: Sending SIGINT to 4373 2018/09/13 19:27:45 I: Calling SetOperationStatus(start-shutdown) 2018/09/13 19:27:45 I: SetOperationStatus(start-shutdown) succeeded 2018/09/13 19:27:55 E: command failed: Time loading forward index: 00:00:05 Time loading reference: 00:00:01 (exit status 137)