Closed Enderlook closed 2 years ago
Hi, thank's for you investigation and of course, I am open to suggestions, ideas and criticism. I need it to make things better. Looks like you really understand the system.
The AsyncLocal<CPU?> approach is definitely new to me. I have to go deeper in this.
So long the idea has been to only get and use the task_cpu within a thread and, if necessary, to deliver results as BigRational. And as you said, it would slow down everithing if the cpu had to ensure threadsafty. Another approach is to keep the core as small and simple as possible, only expanding when it's good for performance or necessary for external solutions.
What is possible with the CPU in terms of threading, but also as an approach to use in iterators, IEnumerable and so on:
Return BigRational numbers from the task_cpu:
var a = new BigRational[1000];
Parallel.For(0, 1000, i => a[i] = MathR.Sin(i * 0.123m, digits: 100)); // 11 ms in debug
or the same using the cpu:
var b = new BigRational[1000];
Parallel.For(0, 1000, i =>
{
var cpu = rat.task_cpu;
cpu.push(i);
cpu.push(0.123m);
cpu.mul();
cpu.sin(80, cos:false); //80: ILog10(digits);
cpu.rnd(24); //digits
b[i] = cpu.popr();
});`
Another approach is to create a manage own cpu's:
var mycpu = new BigRational.CPU();
var task = Task.Run(() =>
{
mycpu.push(1);
mycpu.push(2);
mycpu.push(3);
});
task.Wait();
var x = mycpu.popr();
var y = mycpu.popr();
var z = mycpu.popr();
or for (threadsafe) enumerators too:
static IEnumerable<BigRational> getpows10forever()
{
var mycpu = new rat.CPU();
mycpu.push(1);
for (int i = 0; ; i++)
{
if (i != 0) { mycpu.push(10); mycpu.mul(); }
mycpu.get(0,out rat r); yield return r;
}
}
var c = getpows10forever().Take(100).ToArray();
With the current interface it is also possible to transfere numbers from one cpu to another without memory allocations for an BigRationl number;
var cpu = rat.task_cpu;
mycpu.get(0, out ReadOnlySpan<uint> sp);
cpu.push(sp);
That way it sould be possible to construct special solutions from outside and an AsyncLocal<CPU?> cpu too.
There are many examples in the Test project where the cpu is excessively used in and between threads.
for example:
In a Tesselator where a own cpu acts in parallel as vertexbuffer too
In a polyhedron CSG function parallel over all planes
I will think about your solution and especially if it is maybe possible to automate something like this with out any performance lost or much effort. In the moment however I have no idea how to do.
Regards from Germany, co
So long the idea has been to only get and use the task_cpu within a thread and, if necessary, to deliver results as BigRational. And as you said, it would slow down everithing if the cpu had to ensure threadsafty.
The slow down won't affect your implemented methods if you have both [ThreadLocal] CPU? perThread
and a AsyncLocal<CPU?> perTask
. The perThread
must be only used for methods of the library itself (i.e: be internal
and only used by MathR
and BigRational
), since none of these methods has interruptions inside (i.e: await someTask
) zero-cost overhead is imposed on them.
Only for the public CPU perTask { get; }
property you must use the AsyncLocal<CPU?>
instead. In this way, the overhead is only applied to external code, which -by the way- won't be very costly since consumers can cache this value safety due to the virtue of being async local. Anyway, the implementation of AsyncLocal<T>.Value
don't look really expensive.
Alternatively, you can not provide a public CPU perTask { get; }
and instead let each consumer cache their own CPU
by their own specifically crafted means that suit their need. (However the aproach explained above would fit quite well).
Hi, thanks. I made a quick check:
var t1 = Environment.TickCount;
for(int i = 0; i < 10_000_000;i++) { var t = rat.task_cpu; }
var t2 = Environment.TickCount; // t2 - t1 -> 78 ms
for (int i = 0; i < 10_000_000; i++) { var t = async_cpu; }
var t3 = Environment.TickCount; // t3 - t2 -> 172 ms
78 : 172 ms, that's not bad, much better than I expected, since with async there is always a dictionary access behind:
// in AsyncLocal<>.Value, the critical path
ExecutionContext? current = Thread.CurrentThread._executionContext;
// ...
current.m_localValues.TryGetValue(local, out object? value);
async_cpu I implemented as:
private static AsyncLocal<rat.CPU>? _async_cpu;
public static rat.CPU async_cpu
{
get
{
var p = _async_cpu ??= new AsyncLocal<rat.CPU>();
return p.Value ?? (p.Value = new rat.CPU());
}
}
It would be perfect to find a way to have only one task_cpu that has the properties of async_cpu but is 3 times faster. In other words, finding a fast method to check when task_cpu should check an private AsyncLocal instance - not always. This would be a perfect and safe solution.
Hi, many thanks again for pointing out this issue. After long time, lot of trials and headache... AsyncLocal also not work for all cases... now the simple solution with minimal overhead: For the safe end user API there is now a:
public readonly ref struct SafeCPU
{
readonly CPU cpu;
}
This ref struct encapsulates simply the shared ThreadStatic CPU that is private now. As ref type, it cannot be used in asynchronous methods, asynchronous lambda expressions, query expressions, iterator blocks or inside non-static nested functions. The compiler does not allow this. But for such cases it is safe to use a local new created instance of a CPU object.
Not sure if you accept suggestions, but found a possible problem when reading the documentation of how works this code.
The exposed property
static CPU task_cpu
would not work property on asynchronous code. Imagine the following:The problem also persists when caching the access:
Instead, the
[ThreadStatic] static CPU? cpu;
field should bestatic AsyncLocal<CPU?> cpu = new();
. This wrapper handles all issues that may raise when suspending or resuming asynchronous code.However, this would also slow down implemented operators in
BigRational
, so I would suggest having two properties:BTW, amazing job.