Distributed computation - parallelism & concurrency

Computation does not scale serially. Massively parell systems are on quick rise (including extreme things like fabric CPUs and MPPAs with thousands of mini CPU cores all packaged in one physical CPU). Any new language has to provide first class support for running the code on such systems efficiently.

See the table with challenges and coarse-grained classification of such systems in https://github.com/vlang/v/issues/1868 .

Kind shall support parallel (concurrent) processing. Note also, that I don't think Kind shall focus on instruction-level parallelism as that's a "solved issue" (emitting C code is enough as C compilers - despite them being limited by the language - do the best job of all existing mature tools nowadays - incl. offloading to GPUs, FPGAs etc.).

Instead Kind shall focus on supporting parallelism on a human-guided level for things which require to know semantics of the whole algorithm/system to be efficiently parallelizable (e.g. a compiler will probably never find out, that it shall paralellize at least the dispatch loop of incoming HTTP requests to handle them in parallel because for a compiler this particular piece of algorithm is a needle in a haystack of many more interesting places to parallelize, but with lower final impact than the needle).

For the (tiny and rarely needed) rest there is intrinsics (but I'm not sure Kind should adopt intrinsics as it feels a bit alien in Kind).

@mratsim Kind might be of an interest for you. Btw. your findings regarding CPS and parallelism/concurrency from Nim could be very useful also for Kind.

off-topic: @rosstate Kind might be of an interest for you and your research group

The current plan is to implement a massively parallel compiler based on interaction nets. Have you seen my posts about it? Are you familiar with that? If so, do you still think "human guided" parallelism is necessary even in such a system?

The current plan is to implement a massively parallel compiler based on interaction nets.

Oh, I'm dumb - didn't realize inet means interaction net. Yes, this is another way for "instruction-level parallelism" (not exactly as inets are inherently broader, but read further).

Have you seen my posts about it?

I read one of them and I was delighted Kind takes this approach!

Are you familiar with that?

I know interaction nets exist and how they relate to other computational paradigms. I wouldn't say though I'm familiar with them :wink: (in a sense "write down given algorithm as interaction net on a paper with closed book").

If so, do you still think "human guided" parallelism is necessary even in such a system?

Yes, "human guided" parallelism is definitely needed. Most of the information the inet compiler lacks is human intuition about usage - especially minimum (to save power, bandwidth, etc.) and maximum (max load, max number of users, etc.). Note this can't even be derived from measured data of a running system. Because there is (usually significant) difference between "what the system is designed for" and "how the system during the measurement phase behaved". E.g. I know that my chat & picture distribution platform will grow slowly, but the system is designed (by a human) to handle 10000x higher load right now just in case e.g. some low probable but high-stake event happens.

Imagine M. Zuckerberg dies and Facebook will have management issues leading to massive drain of users and uptake of my tiny "chat & picture distribution platform" in such a short time, that our programmers will be unable to (re)guide the compiler how to parallelize it, so the system has to be designed for such load beforehand and a compiler has no way to do it even if profiled with artificial inputs (this "automated profiling" would most probably end up with the same properties as neural networks have - i.e. their effectivity is directly proportional to training data - but that would mean we must gather enormous amounts of data about each such concurrent/distributed system which is much harder to do well enough and much more costly than "human guided" parallelism).

In other words, human intuition on hollistic behavior of a system is a strict superset of what Kind can infer (assuming practical limits of our world). Kind's ability to parallelize (based on inets) goes strongly in the direction of instruction-level parallelism or "CPU fabric" massive parallelism, but not in the direction of hollistic parallelism and scaling.

With that in mind, let me propose what one portion of the parallel/concurrent/distributed API could offer. I think the following could also become one of interesting selling points of Kind (you made a company around Kind, so this might be worth exploring :wink:).

Namely causal convergence of distributed data in the same vein as CRDTs do. But in a modern form of an "ORDT" as described in http://archagon.net/blog/2018/03/24/data-laced-with-history/ . At least the CT /causal tree/ as described in the article as a good beginning. Maybe in a slightly more generic form which would guide (force) the programmer to specify dependencies & relations needed to form a functional ORDT (Kind seems to be a perfect fit for this) including copy-on-write rope as a backend to scale by default.

Speaking of ORDTs I'd suggest to not use any of the four in-the-article proposed strategies for "baseline determination" by default. And instead use a fifth one by default. Namely to use wall clock time period to designate point in time after which a node can accept (and thus also send out /gossip/) a message determining a baseline whereas anything older will not be "thrown away" but rather divided into chunks and distributed among all peers by using a simple DHT /distributed hash table/ with some sensible replication factor (i.e. the bigger number of peers the less of old history will be stored on each node).

So if some node connects after a long time of offline activity (e.g. a year) it shall converge nicely. At that moment the network will undergo a peak load because all nodes will need to fetch the very old history they didn't need for a year from DHT but I think it's acceptable in general case which IMHO shouldn't sacrifice history but rather make it ridiculously cheap - another advantage is that the DHT could be tweaked to prefer some "special" nodes - e.g. an archive cluster bought (hint: maybe another potential source of income for your company) as a service by the network of users - in which case the whole old history would be stored elsewhere and wouldn't take up precious space on e.g. mobile devices.

The idea is to make the API suitable not only for network communication but equally well also for inter-process communication on the same physical computer (it inflates the data size, but it removes the need for locking).

One other feature the API shall offer is to filter which ORDT messages will be accepted for the final reconstruction and which not based on different criteria - what comes to my mind is e.g. seeing the vector graphics canvas with changes from a selected person being hidden (or changes from the selected person being hidden but only newer than a specified wall clock timestamp). This filtering solves the problem outlined in https://news.ycombinator.com/item?id=17227624 .

The API shall also support "grouping identifier" which is very useful to make a sequence of atomic operations transactionally into a one atomic operation (i.e. making the whole ORDT concept composable).

I strongly dispute that Kind can't "infer" the needed parallelism in any case. Think of it as: anything that can, logically, be parallel will be parallel in such architecture. It makes no sense to let humans "choose" what will be parallel if everything is. I'm sorry for the small answer but it makes no sense to dispute the arguments if I don't agree with the assumption itself. I think I'd need a concrete example of a situation where a human could somehow parallelize better than what Kind can "infer" (which isn't the right word because, again, everything that doesn't have dependencies will be executed in parallel, so there is nothing to infer). Perhaps that will help me getting what you're saying.

I strongly dispute that Kind can't "infer" the needed parallelism in any case. Think of it as: anything that can, logically, be parallel will be parallel in such architecture. It makes no sense to let humans "choose" what will be parallel if everything is.

Now I think I understand what you mean. The premise is, that running multiple concurrent inet agents on one physical CPU core is less efficient than serializing their work into one big inet agent.

Now imagine - with Kind even in very small programs (hundreds of SLOC) with "average" dependencies there are hundreds or thousands of concurrent inet agents. But despite I outlined above "big numbers of separate processing units" (e.g. tens on modern CPUs, hundreds and small thousands on CPU fabrics and many thousands on supercomputers), the number of inet agents is >> (much bigger) than the physical units one could schedule them to. And the problem is, that there is no way how could a compiler reliably find out what has priority (low latency) or what not (throughput). This trade off has to be done by a person. It's actually worse - this changes dynamically in runtime (for finer granularity imagine e.g. big.LITTLE or more complex architectures on modern devices which turn on/off most of the CPUs; for coarser granularity imagine e.g. network distributed computation - network changes quickly and the system has to react e.g. by moving scheduled tasks to other nodes etc.).

I'm sorry for the small answer but it makes no sense to dispute the arguments if I don't agree with the assumption itself. I think I'd need a concrete example of a situation where a human could somehow parallelize better than what Kind can "infer" (which isn't the right word because, again, everything that doesn't have dependencies will be executed in parallel, so there is nothing to infer). Perhaps that will help me getting what you're saying.

Does the explanation above make it more transparent?

I think it does, but I don't think we're talking about the same thing, and I don't think I have the required knowledge to help you.

For the sake of communication, this is what I meant: Interaction nets are a graph with a set of graph-rewrite rules that, when applied repeatedly, will evaluate a functional program to its normal form. The amount of parallelism is measured by the amount of reducible nodes, or redexes, present at the graph at any time. It is usual for graphs to have thousands of redexes at some point, but even if the number of cores is smaller than the number of redexes, it is trivial to split the work among cores. For example, if there are 1600 redexes and 16 cores, you can simply assign 100 redexes to each core. My point was that I believe one can evaluate Kind terms in parallel optimally without any sort of human guidance or annotation.

Regardless, I believe you are talking about something else. You mention a tradeoff between latency and throughput, but I don't know what latency or throughput specifically you're referring to. I still can't visualize the kind of situation where the "instruction level" parallelism I described wouldn't suffice, but I understand that may be the case. I'm sadly not the person to help you meaningfully with that. But I'm completely open to improvement proposals that would make Kind better and expand its use-cases.

My point was that I believe one can evaluate Kind terms in parallel optimally without any sort of human guidance or annotation.

This is definitely true and that's what I like about Kind. But this is what theory says. For practice read below :wink:. And sorry for yet another lengthy post - I'd like to bring clarity into everything here.

For the sake of communication, this is what I meant: Interaction nets are a graph with a set of graph-rewrite rules that, when applied repeatedly, will evaluate a functional program to its normal form. The amount of parallelism is measured by the amount of reducible nodes, or redexes, present at the graph at any time. It is usual for graphs to have thousands of redexes at some point, but even if the number of cores is smaller than the number of redexes, it is trivial to split the work among cores. For example, if there are 1600 redexes and 16 cores, you can simply assign 100 redexes to each core.

This is unfortunately not true in the sense "not aligned with existing HW". The point is, that the task of "scheduling" the parallel work to "physical" processing units has to be guided by human simply because any interaction (for Kind it's basically just 2 things - the work needed to distribute redexes among processing units and second the work to scrape results from these processing units) is very costly in any decently modern HW I know of (not limited to smartphone/desktop CPUs but including CPU fabric products and supercomputers). So it makes a huge difference to compute a "map over 1million integers with a cheap operation producing new 1million integers" on 16-vcore CPU serially (using only one core) or partially parallel (e.g. distributed among 4 cores) or fully in parallel (e.g. distributed among 16 cores). I'd guess on a modern CPU (imagine AMD EPYC) the serial computation would be probably the fastest one and the 16 cores variant probably the slowest.

There many levels of caching and speculative computation come into question and totally shuffle with results. In the end one ends up with bubble sort being double the speed than insertion sort, with Lomuto partitioning (unoptimal) in quick sort being about 1.2-1.3x faster than Hoare's partitioning (optimal) scheme, etc.

Currently the most efficient method I know of (and probably closest to what you refer to as "...simply assign 100 redexes to each core") seems to be work stealing. It has nearly optimal scaling factor but comes with one serious drawback. Namely the state of the art research in that area still doesn't support any runtime reconfiguration. Namely it's not suitable for wearables/smartphones/tablets/notebook and just about anything running on a battery. Which means again, human guidance is necessary.

Regardless, I believe you are talking about something else. You mention a tradeoff between latency and throughput, but I don't know what latency or throughput specifically you're referring to. I still can't visualize the kind of situation where the "instruction level" parallelism I described wouldn't suffice, but I understand that may be the case. I'm sadly not the person to help you meaningfully with that. But I'm completely open to improvement proposals that would make Kind better and expand its use-cases.

The tradeoff latency and throughput is just one of the few most significant side effects of the "scheduling issue" as outlined above. Permitting the scheduler to "randomly" schedule/move redexes to/among processing units is highly inefficient - imagine a supercomputer which has multiple levels of "interlinks" wildly differing in their latency & throughput - so not taking this into account would significantly degrade performance (both latency and throughput). But scheduling it first for CPUs with fastest interlinks (e.g. CPU cores on one die) would result in likely 1000x higher (or more) performance gain with at the same time decreased latency.

Then next thing comes into play - in case CPU cores have different performance characteristics (some are big CPUs, some are GPUs, some are FPGAs, some are small CPUs, etc.) the difference between computational difficulty of each redex (which wildly varies) will become a nightmare for random/round_robin scheduling. Scheduling difficult redexes to weak processing units will take longer. Scheduling easy redexes to big CPUs will increase power consumption and will close the door to difficult redexes to be scheduled for these big CPUs because sending difficult redexes (usually correlates with their size) back and forth is likely even costlier than computing them on weak processing units. And this everything changes in runtime - the main driver is saving power by turning CPUs/GPUs/FPGAs/... off and on.

So despite me talking about internet-scale networks, I'd rather skip it for now and focused only on the power saving aspect with turning on and off varyingly powerful/weak processing units with different computing characteristics (CPU versus GPU versus FPGA etc.). That's low-level enough and any other distributed computation (incl. ORDTs) should be possible to model on top of this lower-level concept.

Are we on the same boat now?

This is unfortunately not true in the sense "not aligned with existing HW". The point is, that the task of "scheduling" the parallel work to "physical" processing units has to be guided by human simply because any interaction (for Kind it's basically just 2 things - the work needed to distribute redexes among processing units and second the work to scrape results from these processing units) is very costly in any decently modern HW I know of (not limited to smartphone/desktop CPUs but including CPU fabric products and supercomputers).

I think I get what you're talking about now. It seems like the problem you're trying to solve is "how to get the most performance from existing hardware, which has wildly different types of cores, performance characteristics, costs, etc.". But isn't that saying more about how CPUs are designed currently, than the merits of the algorithm itself? I think I'd be happy enough to have a "general purpose" parallel reducer that does it best to achieve reasonable results. For example, if a computer has 8 cores, and if your Kind program is inherently parallel (such as recursing over a tree), then a gain a few %s behind the theoretical maximum of 800% would be great to me, because it would still be massively superior to running everything in a single core. But if you're talking about how to get these extra %s to get as close as possible to the optimal, in existing wild hardware, then I think that is a much more complex problem, but I don't think it is too important for now. I mean, I'm not aware of any functional language which automatically parallelizes recursive algorithms, so just doing that, even with some loss, would be a breakthrough to me, and it seems like a reasonable initial goal.

I mean, I'm not aware of any functional language which automatically parallelizes recursive algorithms, so just doing that, even with some loss, would be a breakthrough to me, and it seems like a reasonable initial goal.

Yes, this is really the cool thing and I absolutely agree.

On the other hand I'm afraid, that those "few %s behind the theoretical maximum" are actually e.g. 40% of the theoretical max performance due to highly inefficient use of the resource (and this is actually quite optimistic guess based on what I know of inets - I saw even worse numbers in practice e.g. with work stealing - especially when the work gets divided into very small chunks - everybody knows, that e.g. tight loops can be made many times faster only by unrolling them; everybody knows that scheduling small tasks among different cores forces the core invalidate its cache and thus wait for memory which slows it down by order of a magnitude or more, etc.).

But I definitely agree, that we should first measure it and first then get back to the drawing board. I'm though 100% certain, that some full-featured API for "human guidance of parallelism" will be inevitable at some point.

FYI I can see some similarity (seen from the birds perpective) between Asynchronous Everything and parallelization up to the smallest redex.

I highly recommend reading the article to gain insight of where a "naive" full-parallelization leads (not speaking of the fact that in case of real parallelism across independent CPU cores the losses and inefficiency will be significantly amplified than with async as async runs on one core, but inets will use multiple).

HigherOrderCO / Kind

Distributed computation - parallelism & concurrency #167