Too many send and receive thread in timely processor

FangYongs commented 6 years ago

I use a few hundreds processors in my job, and I found that each processor has too many send and receive thread as follows. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 16969 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.11 send thread 0 16970 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.04 recv thread 0 16971 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.12 send thread 1 16972 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.05 recv thread 1 16973 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.12 send thread 2 16974 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.07 recv thread 2 16975 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.12 send thread 3 16976 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.04 recv thread 3 16977 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.09 send thread 4 16978 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.01 recv thread 4 16979 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.12 send thread 5 16980 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.06 recv thread 5 16981 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.12 send thread 6 16982 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.03 recv thread 6 16983 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.11 send thread 7 16984 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.01 recv thread 7 16985 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.12 send thread 8 16986 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.07 recv thread 8 16987 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.11 send thread 9 16988 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.00 recv thread 9 16989 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.11 send thread 10 16990 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.05 recv thread 10 16991 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.10 send thread 11 16992 admin 20 0 6855696 2.694g 46596 S 0.0 0.7 0:00.04 recv thread 11

When I use 400 processors in my job, there will be 800 send/recv threads in each processor. I think it should be optimized for there are too many threads in one processor. What do you think? Thanks @frankmcsherry

frankmcsherry commented 6 years ago

Hi!

This is a good question. You are definitely outside the normal operating parameters of the system, and 400 processes would be more than we've ever run even Naiad on.

I'm not certain how many worker threads you have in each process, but as you start to get up in the thousands of workers the point-to-point communication design needs to be replaced by a hierarchical design. There is a hierarchy at the moment, in that within a process all workers multiplex their communication destined for the same remote process on one thread pair, but it seems like you would need something like rack-/pod-local hierarchy.

Adding further layers of hierarchy isn't fundamentally complicated, but it is not present at the moment and would require further work. I don't have access to that sort of hardware, or workloads that require it, but I could talk you through the design if you were interested in trying it out. Alternately, I'm also happy to try and give some advice on how best to accomplish your computational goals, if not on timely, if you can say a bit more about what you are doing that needs this volume of workers.

If you have an acyclic dataflow, you can fake out hierarchy by using the capture/replay machinery to link up independent timely dataflows (e.g. a pod-local dataflow for each pod that ends in a capture, which is then replayed in independent inter-pod dataflows, and perhaps exchanged back as appropriate). This is clearly a bandage, but if you need this to happen it should work (modulo bugs).

Cheers, Frank

FangYongs commented 6 years ago

Thank you for your suggestions. I want to compute tens of billions of edges and hundreds of millions of vertexes in my timely job, and I can only use about 20G memory in each processor in my cluster. So I want to increase the number of processors in my job and do all the computation in the memory without any flow control. I may optimize the thread model of the network in timely later, and I hope that you will help to review the design when you're free. Thank you very much

frankmcsherry commented 6 years ago

Hi,

It sounds like the graph you want to process is quite large. My understanding is that there aren't a lot of main-memory systems that are comfortable with such large graphs. We've done it in timely, but on 16 machines each with 256GB of RAM. You can totally rent machines like that on EC2, and it would probably be more likely to work than hundreds of 20GB instances, if you have the option.

Other options include disk-based systems, which with a few SSDs can be quite fast. One example is FlashGraph which has reported good numbers using SSD on one machine, rather than a distributed collection of small computers.

Hope this helps!

FangYongs commented 6 years ago

Thank you for your suggestion. I run my job on a large cluster with thousands machines, and there're many other jobs run on the cluster. so I can't apply too much memory for each timely processor, instead I can apply much more processors with fewer memory for my timely job, such as 400 processors and 30G RAM for each processor, or even much more processors.

I use a large cluster to run my timely job, and there's no SSD in the cluster machine. Include disk-based systems sounds good, but it doesn't apply to my case.

Thanks

TimelyDataflow / timely-dataflow

Too many send and receive thread in timely processor #153