frazer-lab / cluster

Repo for cluster issues.
1 stars 0 forks source link

SGE any limit of string length #234

Closed hirokomatsui closed 6 years ago

hirokomatsui commented 6 years ago

I'm wondering if SGE has any limit for the string length of such as,

qstat returned: 3> qstat -j 5167063 failed receiving gdi request response for mid=1 (got syncron message receive timeout error). Following jobs do not exist: 5167063

qstat -j 5167061 error: commlib error: got select error (Connection reset by peer) unable to contact qmaster using port 6444 on host "fl-hn2" failed receiving gdi request response for mid=1 (can't find connection).

The one of the scripts with long names is: /home/bill/CARDIPS-caQTL-manuscript/InitialWork/100bp-150bp-batch-effects/100bp-downsample-fastas/output/29f6f3bb-de7e-4204-adf3-19149bf7c143/sh/29f6f3bb-de7e-4204-adf3-19149bf7c143_submit_2018_01_29_17_26_55_160728.sh

I'll test some more tomorrow. Just in the case if you know something quick.

tatarsky commented 6 years ago

unable to contact qmaster using port 6444 on host "fl-hn2"

I saw two alerts from fl-hn2 a moment ago which appear to be related to the CPU load on that system. There are several user items running on fl-hn2. Take a look at top to see what I mean.

While I can move the sge_qmaster, might be worth understanding why those are not jobs....

I am not aware of long filename restrictions in SGE.

tatarsky commented 6 years ago

The fl-hn2 system is quite slow. Here are the users running items on fl-hn2 from a quick top grab. If you cannot contact them I will renice and if necessary kill these processes.


  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                    
29282 bill      20   0 2177972 428008    416 D  55.6  0.2   3:01.24 python                                                     
29281 bill      20   0 2216164 466512    728 R  50.0  0.2   3:02.41 python                                                     
 1683 root       0 -20       0      0      0 S  44.4  0.0   0:07.64 kworker/20:1H                                              
28061 tsolomon  20   0  332696 317192    372 D  11.1  0.1 197:40.13 vcftools                                                   
29602 root      20   0   43240   2520   1300 R  11.1  0.0   0:01.83 top                                                        
  248 root      20   0       0      0      0 D   5.6  0.0   4:17.01 kswapd1                                                    
 6107 bill      20   0   43600  20228    380 D   5.6  0.0  90:26.94 samtools                                                   
 7565 sgeadmin  20   0 2161880 617152   1660 S   5.6  0.2   1414:53 sge_qmaster                                                
22274 djakubo+  20   0  496640  21280    680 S   5.6  0.0  40:28.37 jupyter-noteboo                                            
29280 bill      20   0 2012152 305948    424 D   5.6  0.1   2:32.99 python                                                     
29283 bill      20   0 2041848 345972    572 D   5.6  0.1   2:41.12 python                                                     
29284 bill      20   0 2042872 346716    572 D   5.6  0.1   2:49.42 python                                                     
29285 bill      20   0 2058488 367804    572 D   5.6  0.1   2:53.67 python                                                     
29286 bill      20   0 2056440 364960    340 D   5.6  0.1   2:52.64 python                                                     
29287 bill      20   0 2033912 335204    572 D   5.6  0.1   2:39.58 python                                                     
29288 root       0 -20   28820  10420   3664 D   5.6  0.0   0:18.67 atop                                                       
29292 djakubo+  20   0 2542440 959896   1056 D   5.6  0.4   0:44.61 python            
tatarsky commented 6 years ago

Load is > 90 on fl-hn2.

hirokomatsui commented 6 years ago

OK, so maybe the errors of qstat are not related with the submitted jobs.

tatarsky commented 6 years ago

It seems that way at the moment. Looking for the highest impact item on fl-hn2. If I have to move the qmaster I will or I will implement limits.

tatarsky commented 6 years ago

Unit is also swapping alot due to high memory consuming bedtools:

18350     -      0      0   1572K  4012K 240.4G   140K  240.4G 234.8G     0K     0K   -204K   4.8G djakubos  93%  bedtools
tatarsky commented 6 years ago

@djakubosky are you aware you have a bedtools on fl-hn2 consuming that much ram?

tatarsky commented 6 years ago

@billgreenwald just curious what the python items on fl-hn2 are doing.

CPU usage is high. Just wanting to understand the impact. No change needed yet.

billgreenwald commented 6 years ago

Sorry I'm running some stuff on flh2. Feel free to kill it if necessary and I'll restart them when I'm at a computer on the cluster instead. It was impossible to get onto the cluster for a while due to a ton of jobs running so I've been using the head node instead.

On Jan 29, 2018 6:20 PM, "tatarsky" notifications@github.com wrote:

It seems that way at the moment. Looking for the highest impact item on fl-hn2. If I have to move the qmaster I will or I will implement limits.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/frazer-lab/cluster/issues/234#issuecomment-361455050, or mute the thread https://github.com/notifications/unsubscribe-auth/AMcukcRL3YGNkeahg8f0ulQsMuPcJRFfks5tPnxRgaJpZM4RxqH0 .

tatarsky commented 6 years ago

I'm trying a renice first.

tatarsky commented 6 years ago

Seems to be helping a bit. Holding for now. Thanks for evening response @billgreenwald

djakubosky commented 6 years ago

Hi Paul, yes I’m aware, I wasn’t able to kill this because I couldn’t do anything in the terminal (everything locked up) feel free to kill

On Mon, Jan 29, 2018 at 6:23 PM tatarsky notifications@github.com wrote:

@djakubosky https://github.com/djakubosky are you aware you have a bedtools on fl-hn2 consuming that much ram?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/frazer-lab/cluster/issues/234#issuecomment-361455644, or mute the thread https://github.com/notifications/unsubscribe-auth/AOUvtmmIJK5h3UiqGjR-i5RENamJERFZks5tPn0ggaJpZM4RxqH0 .

-- ____

David Jakubosky Biomedical Sciences Graduate Program Laboratory of Kelly A. Frazer, PhD Institute for Genomic Medicine, University of California at San Diego


tatarsky commented 6 years ago

I don't need to kill it at the moment as the CPU load is better....

tatarsky commented 6 years ago

I think we're ok for now @djakubosky. I'll advise if memory becomes an issue compared to cpu load. Thanks also for evening response.

tatarsky commented 6 years ago

@hirokomatsui advise if matters continue with reniced jobs on fl-hn2 and I'll move the qmaster over to fl-hn1.

tatarsky commented 6 years ago

I have decided for the evening to failover the qmaster to fl-hn1. We can revisit in the morning.

billgreenwald commented 6 years ago

The python is a multithreaded ATAC-seq analysis program I'm running

On Jan 29, 2018 7:04 PM, "tatarsky" notifications@github.com wrote:

I have decided for the evening to failover the qmaster to fl-hn1. We can revisit in the morning.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/frazer-lab/cluster/issues/234#issuecomment-361462256, or mute the thread https://github.com/notifications/unsubscribe-auth/AMcukTzlEsaGvNmVA6k9vZqo_9ejXmvJks5tPobLgaJpZM4RxqH0 .

tatarsky commented 6 years ago

That was one impact the other is a series of "sort" jobs being done on the old nodes which is not efficient as its bottlenecked by the 1Gbit/sec max from the old nodes. That raises CPU load as well since those nodes have only NFS access to the /frazer01 filesystem via fl-hn2.

The combined CPU load was enough to delay qmaster running on fl-hn2 enough that there were those errors (timeouts). Renicing was helping but for now failing over to the mostly idle fl-hn1 is fine. (Both head nodes can be the qmaster)