Open FedericaBrando opened 10 months ago
Pipeline get stuck as if resources were shared and not fully allocated to single process.
Oncodrive3D is running on several nodes. It has a step that use multiprocessing that divides a for loop in different processes: as many as it is the cpu allocated (in this case 14 cores).
Although some processes have been 2 days stuck in this loop, and in some instances the logs are very weird:
2023-10-27 15:07:18,541 - INFO | oncodrive3d - ######################################################################
2023-10-27 15:07:18,541 - INFO | oncodrive3d - # #
2023-10-27 15:07:18,542 - INFO | oncodrive3d - # Welcome to Oncodrive3D! #
2023-10-27 15:07:18,542 - INFO | oncodrive3d - # #
2023-10-27 15:07:18,543 - INFO | oncodrive3d - # Initializing analysis... #
2023-10-27 15:07:18,543 - INFO | oncodrive3d - # Version: 2023.08.23 #
2023-10-27 15:07:18,543 - INFO | oncodrive3d - # Author: Biomedical Genomics Lab - IRB Barcelona #
2023-10-27 15:07:18,544 - INFO | oncodrive3d - # Support: stefano.pellegrini@irbbarcelona.org #
2023-10-27 15:07:18,544 - INFO | oncodrive3d - # #
2023-10-27 15:07:18,545 - INFO | oncodrive3d - ######################################################################
2023-10-27 15:07:18,545 - INFO | oncodrive3d -
2023-10-27 15:07:18,546 - INFO | oncodrive3d - Input MAF: TCGA_WXS_HNSC.in.tsv.gz
2023-10-27 15:07:18,546 - INFO | oncodrive3d - Input mut profile: TCGA_WXS_HNSC.sig.json
2023-10-27 15:07:18,547 - INFO | oncodrive3d - Build directory: /workspace/projects/intogen_plus/fixdatasets-20230223/intogen-plus-dev-o3d/datasets/oncodrive3d
2023-10-27 15:07:18,547 - INFO | oncodrive3d - Output directory: .
2023-10-27 15:07:18,548 - DEBUG | oncodrive3d - Path to CMAPs: /workspace/projects/intogen_plus/fixdatasets-20230223/intogen-plus-dev-o3d/datasets/oncodrive3d/prob_cmaps
2023-10-27 15:07:18,548 - DEBUG | oncodrive3d - Path to DNA sequences: /workspace/projects/intogen_plus/fixdatasets-20230223/intogen-plus-dev-o3d/datasets/oncodrive3d/seq_for_mut_prob.csv
2023-10-27 15:07:18,549 - DEBUG | oncodrive3d - Path to PAE: /workspace/projects/intogen_plus/fixdatasets-20230223/intogen-plus-dev-o3d/datasets/oncodrive3d/pae
2023-10-27 15:07:18,549 - DEBUG | oncodrive3d - Path to pLDDT scores: /workspace/projects/intogen_plus/fixdatasets-20230223/intogen-plus-dev-o3d/datasets/oncodrive3d/confidence.csv
2023-10-27 15:07:18,550 - INFO | oncodrive3d - CPU cores: 14
2023-10-27 15:07:18,550 - INFO | oncodrive3d - Iterations: 10000
2023-10-27 15:07:18,551 - INFO | oncodrive3d - Significant level: 0.01
2023-10-27 15:07:18,551 - INFO | oncodrive3d - Probability threshold for CMAPs: 0.5
2023-10-27 15:07:18,552 - INFO | oncodrive3d - Disable fragments: False
2023-10-27 15:07:18,552 - INFO | oncodrive3d - Output only processed genes: True
2023-10-27 15:07:18,553 - INFO | oncodrive3d - Cohort: TCGA_WXS_HNSC
2023-10-27 15:07:18,553 - INFO | oncodrive3d - Cancer type: HNSC
2023-10-27 15:07:18,554 - INFO | oncodrive3d - Verbose: True
2023-10-27 15:07:18,554 - INFO | oncodrive3d - Seed: 123
2023-10-27 15:07:18,555 - INFO | oncodrive3d - Log path: ./log
2023-10-27 15:07:18,555 - INFO | oncodrive3d -
2023-10-27 15:07:18,556 - DEBUG | oncodrive3d.utils.utils - Reading input MAF...
2023-10-27 15:07:18,846 - DEBUG | oncodrive3d.utils.utils - Processing [82100] total mutations...
2023-10-27 15:07:18,917 - DEBUG | oncodrive3d.utils.utils - Processing [54212] missense mutations...
2023-10-27 15:07:24,944 - DEBUG | oncodrive3d - Detected [4063] genes without enough mutations: Skipping...
2023-10-27 15:07:36,076 - DEBUG | oncodrive3d - Detected [13] genes without IDs mapping: Skipping...
2023-10-27 15:07:36,079 - INFO | oncodrive3d - Computing missense mut probabilities...
2023-10-27 15:08:45,148 - INFO | oncodrive3d - Performing 3D-clustering on [10561] proteins...
2023-10-27 15:08:45,351 - DEBUG | oncodrive3d.utils.clustering - Starting [14] processes...
2023-10-27 15:09:20,552 - DEBUG | oncodrive3d.utils.clustering - Process [1] starting...
[...]
2023-10-27 15:13:04,903 - DEBUG | oncodrive3d.utils.clustering - Process [7] starting...
[...]
2023-10-27 15:21:18,970 - DEBUG | oncodrive3d.utils.clustering - Process [8] completed [141/736] structures...
2023-10-27 15:21:21,080 - DEBUG | oncodrive3d.utils.clustering - Process [3] completed [261/736] structures...
it is stuck there since two days ago. Same with another process:
2023-10-27 17:44:53,693 - INFO | oncodrive3d - ######################################################################
2023-10-27 17:44:53,693 - INFO | oncodrive3d - # #
2023-10-27 17:44:53,694 - INFO | oncodrive3d - # Welcome to Oncodrive3D! #
2023-10-27 17:44:53,694 - INFO | oncodrive3d - # #
2023-10-27 17:44:53,695 - INFO | oncodrive3d - # Initializing analysis... #
2023-10-27 17:44:53,695 - INFO | oncodrive3d - # Version: 2023.08.23 #
2023-10-27 17:44:53,696 - INFO | oncodrive3d - # Author: Biomedical Genomics Lab - IRB Barcelona #
2023-10-27 17:44:53,696 - INFO | oncodrive3d - # Support: stefano.pellegrini@irbbarcelona.org #
2023-10-27 17:44:53,697 - INFO | oncodrive3d - # #
2023-10-27 17:44:53,697 - INFO | oncodrive3d - ######################################################################
2023-10-27 17:44:53,697 - INFO | oncodrive3d -
2023-10-27 17:44:53,698 - INFO | oncodrive3d - Input MAF: TCGA_WXS_CCRCC.in.tsv.gz
2023-10-27 17:44:53,699 - INFO | oncodrive3d - Input mut profile: TCGA_WXS_CCRCC.sig.json
[...]
2023-10-27 17:47:52,672 - INFO | oncodrive3d - Computing missense mut probabilities...
2023-10-27 18:13:06,252 - INFO | oncodrive3d - Performing 3D-clustering on [3660] proteins...
2023-10-27 18:13:12,476 - DEBUG | oncodrive3d.utils.clustering - Starting [14] processes...
[...]
2023-10-27 22:42:57,957 - DEBUG | oncodrive3d.utils.clustering - Process [5] completed [111/255] structures...
NodeName=bbgn019 Arch=x86_64 CoresPerSocket=14
CPUAlloc=48 CPUErr=0 CPUTot=56 CPULoad=35.18
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
GresDrain=N/A
NodeAddr=bbgn019 NodeHostName=bbgn019 Version=16.05
OS=Linux RealMemory=512000 AllocMem=360448 FreeMem=429695 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
BootTime=2023-06-09T13:23:10 SlurmdStartTime=2016-01-01T01:05:30
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
ping to @migrau .
Ferriol told me you had similar issue with deepUMI pipeline, how did you solve?
here the loop that use multi processing: https://github.com/bbglab/clustering_3d/blob/b70857d1f88b215f097d6bf0351a8ce4f8ac5191/scripts/utils/clustering.py#L290C3-L306C9
Nevermind, I restarted the pipeline and it failed the processes because of exceded memory limit. I increased the memory for Oncorive3D process to 32 and it did not happen again.