ddavisqa / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

Feature : Google Refine to use all processors #343

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Hi,

Would be nice if Google refine could use all processors / cores available on a 
server. Seems, for the moment, only one proc / core is used.
Parallelization may speed up things (I'm using GRefine on big files : from 4 go 
to 12 go).

Ps : very nice software. Congrats and keep up the good work.  ;)

[root@myserver google-refine-2.0]# mpstat 1 -P ALL
Linux 2.6.21.7-2.ec2.v1.2.fc8xen (myserver)     02/28/11

12:50:31     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   
%idle    intr/s
12:50:32     all   50.00    0.00    0.00    0.00    0.00    0.00    0.00   
50.00   1053.00
12:50:32       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  
100.00     53.00
12:50:32       1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    
0.00   1001.00

12:50:32     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   
%idle    intr/s
12:50:33     all   49.75    0.00    0.00    0.00    0.00    0.00    0.00   
50.25   1029.00
12:50:33       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  
100.00     28.00
12:50:33       1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    
0.00   1000.00

12:50:33     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   
%idle    intr/s
12:50:34     all   50.00    0.00    0.00    0.00    0.00    0.00    0.00   
50.00   1035.00
12:50:34       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  
100.00     35.00
12:50:34       1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    
0.00   1000.00

12:50:34     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   
%idle    intr/s
12:50:35     all   50.00    0.00    0.00    0.00    0.00    0.00    0.00   
50.00   1028.00
12:50:35       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  
100.00     28.00
12:50:35       1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    
0.00   1001.00

12:50:35     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   
%idle    intr/s
12:50:36     all   50.00    0.00    0.00    0.00    0.00    0.00    0.00   
50.00   1030.00
12:50:36       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  
100.00     29.00
12:50:36       1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    
0.00   1000.00

Original issue reported on code.google.com by vincent....@gmail.com on 28 Feb 2011 at 5:54

GoogleCodeExporter commented 9 years ago
What operations would it be most valuable to you to have parallelized?

Original comment by tfmorris on 28 Feb 2011 at 6:02

GoogleCodeExporter commented 9 years ago
BTW, clustering already uses all the cores it can find.

Original comment by stefa...@google.com on 28 Feb 2011 at 7:18

GoogleCodeExporter commented 9 years ago
QUOTE : What operations would it be most valuable to you to have parallelized?

ANSWER : when creating a project based on a big flat file (big for us is 
starting at 4 go). So I mean when importing big files.

Today we create project by calling files stored on AWS S3 / Cloudfront. Works 
fine, except for really big file. In that case, Refine "works" (from what we 
can see using iostat, mpstat, top, etc ... the data is loaded) but nothing is 
updated on the browser side.

Anyway, adding multi core / proc usage would be fine for "extreme" users like 
us (dataminers) and adding interactivity / information on the browser side 
would also be nice in order to be aware of what's going on (ram feeding, 
freebase inserts, etc ...).

Many thanks

Original comment by vincent....@gmail.com on 1 Mar 2011 at 9:32