Closed hirokomatsui closed 6 years ago
unable to contact qmaster using port 6444 on host "fl-hn2"
I saw two alerts from fl-hn2 a moment ago which appear to be related to the CPU load on that system. There are several user items running on fl-hn2. Take a look at top to see what I mean.
While I can move the sge_qmaster, might be worth understanding why those are not jobs....
I am not aware of long filename restrictions in SGE.
The fl-hn2 system is quite slow. Here are the users running items on fl-hn2 from a quick top grab. If you cannot contact them I will renice and if necessary kill these processes.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29282 bill 20 0 2177972 428008 416 D 55.6 0.2 3:01.24 python
29281 bill 20 0 2216164 466512 728 R 50.0 0.2 3:02.41 python
1683 root 0 -20 0 0 0 S 44.4 0.0 0:07.64 kworker/20:1H
28061 tsolomon 20 0 332696 317192 372 D 11.1 0.1 197:40.13 vcftools
29602 root 20 0 43240 2520 1300 R 11.1 0.0 0:01.83 top
248 root 20 0 0 0 0 D 5.6 0.0 4:17.01 kswapd1
6107 bill 20 0 43600 20228 380 D 5.6 0.0 90:26.94 samtools
7565 sgeadmin 20 0 2161880 617152 1660 S 5.6 0.2 1414:53 sge_qmaster
22274 djakubo+ 20 0 496640 21280 680 S 5.6 0.0 40:28.37 jupyter-noteboo
29280 bill 20 0 2012152 305948 424 D 5.6 0.1 2:32.99 python
29283 bill 20 0 2041848 345972 572 D 5.6 0.1 2:41.12 python
29284 bill 20 0 2042872 346716 572 D 5.6 0.1 2:49.42 python
29285 bill 20 0 2058488 367804 572 D 5.6 0.1 2:53.67 python
29286 bill 20 0 2056440 364960 340 D 5.6 0.1 2:52.64 python
29287 bill 20 0 2033912 335204 572 D 5.6 0.1 2:39.58 python
29288 root 0 -20 28820 10420 3664 D 5.6 0.0 0:18.67 atop
29292 djakubo+ 20 0 2542440 959896 1056 D 5.6 0.4 0:44.61 python
Load is > 90 on fl-hn2.
OK, so maybe the errors of qstat are not related with the submitted jobs.
It seems that way at the moment. Looking for the highest impact item on fl-hn2. If I have to move the qmaster I will or I will implement limits.
Unit is also swapping alot due to high memory consuming bedtools:
18350 - 0 0 1572K 4012K 240.4G 140K 240.4G 234.8G 0K 0K -204K 4.8G djakubos 93% bedtools
@djakubosky are you aware you have a bedtools on fl-hn2
consuming that much ram?
@billgreenwald just curious what the python items on fl-hn2 are doing.
CPU usage is high. Just wanting to understand the impact. No change needed yet.
Sorry I'm running some stuff on flh2. Feel free to kill it if necessary and I'll restart them when I'm at a computer on the cluster instead. It was impossible to get onto the cluster for a while due to a ton of jobs running so I've been using the head node instead.
On Jan 29, 2018 6:20 PM, "tatarsky" notifications@github.com wrote:
It seems that way at the moment. Looking for the highest impact item on fl-hn2. If I have to move the qmaster I will or I will implement limits.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/frazer-lab/cluster/issues/234#issuecomment-361455050, or mute the thread https://github.com/notifications/unsubscribe-auth/AMcukcRL3YGNkeahg8f0ulQsMuPcJRFfks5tPnxRgaJpZM4RxqH0 .
I'm trying a renice first.
Seems to be helping a bit. Holding for now. Thanks for evening response @billgreenwald
Hi Paul, yes I’m aware, I wasn’t able to kill this because I couldn’t do anything in the terminal (everything locked up) feel free to kill
On Mon, Jan 29, 2018 at 6:23 PM tatarsky notifications@github.com wrote:
@djakubosky https://github.com/djakubosky are you aware you have a bedtools on fl-hn2 consuming that much ram?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/frazer-lab/cluster/issues/234#issuecomment-361455644, or mute the thread https://github.com/notifications/unsubscribe-auth/AOUvtmmIJK5h3UiqGjR-i5RENamJERFZks5tPn0ggaJpZM4RxqH0 .
-- ____
David Jakubosky Biomedical Sciences Graduate Program Laboratory of Kelly A. Frazer, PhD Institute for Genomic Medicine, University of California at San Diego
I don't need to kill it at the moment as the CPU load is better....
I think we're ok for now @djakubosky. I'll advise if memory becomes an issue compared to cpu load. Thanks also for evening response.
@hirokomatsui advise if matters continue with reniced jobs on fl-hn2 and I'll move the qmaster over to fl-hn1.
I have decided for the evening to failover the qmaster to fl-hn1. We can revisit in the morning.
The python is a multithreaded ATAC-seq analysis program I'm running
On Jan 29, 2018 7:04 PM, "tatarsky" notifications@github.com wrote:
I have decided for the evening to failover the qmaster to fl-hn1. We can revisit in the morning.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/frazer-lab/cluster/issues/234#issuecomment-361462256, or mute the thread https://github.com/notifications/unsubscribe-auth/AMcukTzlEsaGvNmVA6k9vZqo_9ejXmvJks5tPobLgaJpZM4RxqH0 .
That was one impact the other is a series of "sort" jobs being done on the old nodes which is not efficient as its bottlenecked by the 1Gbit/sec max from the old nodes. That raises CPU load as well since those nodes have only NFS access to the /frazer01 filesystem via fl-hn2.
The combined CPU load was enough to delay qmaster running on fl-hn2 enough that there were those errors (timeouts). Renicing was helping but for now failing over to the mostly idle fl-hn1 is fine. (Both head nodes can be the qmaster)
I'm wondering if SGE has any limit for the string length of such as,
qstat returned: 3> qstat -j 5167063 failed receiving gdi request response for mid=1 (got syncron message receive timeout error). Following jobs do not exist: 5167063
The one of the scripts with long names is: /home/bill/CARDIPS-caQTL-manuscript/InitialWork/100bp-150bp-batch-effects/100bp-downsample-fastas/output/29f6f3bb-de7e-4204-adf3-19149bf7c143/sh/29f6f3bb-de7e-4204-adf3-19149bf7c143_submit_2018_01_29_17_26_55_160728.sh
I'll test some more tomorrow. Just in the case if you know something quick.