Possible issues may arise when using this code in asynchronous methods

Enderlook commented 2 years ago

Not sure if you accept suggestions, but found a possible problem when reading the documentation of how works this code.

The exposed property static CPU task_cpu would not work property on asynchronous code. Imagine the following:

// Thread "1".
BigRational.task_cpu.push(value1);
await SomeOperation();
// Resume on thread "2".
BigRational.task_cpu.push(value2); // Pushing on a different cpu.

The problem also persists when caching the access:

// Thread "1".
CPU cpu= BigRational.task_cpu;
cpu.task_cpu.push(value1);
await SomeOperation();
// Resume on thread "2".
cpu.task_cpu.push(value2); // Pushing on cpu of thread "1" from thread "2", which is not thread safe.

Instead, the [ThreadStatic] static CPU? cpu; field should be static AsyncLocal<CPU?> cpu = new();. This wrapper handles all issues that may raise when suspending or resuming asynchronous code.

However, this would also slow down implemented operators in BigRational, so I would suggest having two properties:

// Only used inside the library for the operators.
private static CPU thread_cpu_ => thread_cpu??= new();
[ThreadStatic]
private static CPU? thread_cpu_;

public static CPU task_cpu => cpu.Value ??= new();
private static AsyncLocal<CPU?> cpu;

BTW, amazing job.

c-ohle commented 2 years ago

Hi, thank's for you investigation and of course, I am open to suggestions, ideas and criticism. I need it to make things better. Looks like you really understand the system.

The AsyncLocal<CPU?> approach is definitely new to me. I have to go deeper in this.

So long the idea has been to only get and use the task_cpu within a thread and, if necessary, to deliver results as BigRational. And as you said, it would slow down everithing if the cpu had to ensure threadsafty. Another approach is to keep the core as small and simple as possible, only expanding when it's good for performance or necessary for external solutions.

What is possible with the CPU in terms of threading, but also as an approach to use in iterators, IEnumerable and so on:

Return BigRational numbers from the task_cpu:

    var a = new BigRational[1000];
    Parallel.For(0, 1000, i => a[i] = MathR.Sin(i * 0.123m, digits: 100)); // 11 ms in debug

or the same using the cpu:

    var b = new BigRational[1000];
    Parallel.For(0, 1000, i =>
    {
      var cpu = rat.task_cpu;
      cpu.push(i);
      cpu.push(0.123m);
      cpu.mul();
      cpu.sin(80, cos:false); //80: ILog10(digits);  
      cpu.rnd(24); //digits
      b[i] = cpu.popr();
    });`

Another approach is to create a manage own cpu's:

  var mycpu = new BigRational.CPU();
  var task = Task.Run(() =>
  {
    mycpu.push(1);
    mycpu.push(2);
    mycpu.push(3);
  });
  task.Wait();
  var x = mycpu.popr();
  var y = mycpu.popr();
  var z = mycpu.popr();

or for (threadsafe) enumerators too:

  static IEnumerable<BigRational> getpows10forever()
  {
    var mycpu = new rat.CPU();
    mycpu.push(1);
    for (int i = 0; ; i++)
    {
      if (i != 0) { mycpu.push(10); mycpu.mul(); }
      mycpu.get(0,out rat r); yield return r;
    }
  }
  var c = getpows10forever().Take(100).ToArray();

With the current interface it is also possible to transfere numbers from one cpu to another without memory allocations for an BigRationl number;

    var cpu = rat.task_cpu;
    mycpu.get(0, out ReadOnlySpan<uint> sp); 
    cpu.push(sp);

That way it sould be possible to construct special solutions from outside and an AsyncLocal<CPU?> cpu too. There are many examples in the Test project where the cpu is excessively used in and between threads. for example: In a Tesselator where a own cpu acts in parallel as vertexbuffer too
In a polyhedron CSG function parallel over all planes

I will think about your solution and especially if it is maybe possible to automate something like this with out any performance lost or much effort. In the moment however I have no idea how to do.

Regards from Germany, co

Enderlook commented 2 years ago

So long the idea has been to only get and use the task_cpu within a thread and, if necessary, to deliver results as BigRational. And as you said, it would slow down everithing if the cpu had to ensure threadsafty.

The slow down won't affect your implemented methods if you have both [ThreadLocal] CPU? perThread and a AsyncLocal<CPU?> perTask. The perThread must be only used for methods of the library itself (i.e: be internal and only used by MathR and BigRational), since none of these methods has interruptions inside (i.e: await someTask) zero-cost overhead is imposed on them. Only for the public CPU perTask { get; } property you must use the AsyncLocal<CPU?> instead. In this way, the overhead is only applied to external code, which -by the way- won't be very costly since consumers can cache this value safety due to the virtue of being async local. Anyway, the implementation of AsyncLocal<T>.Value don't look really expensive.

Alternatively, you can not provide a public CPU perTask { get; } and instead let each consumer cache their own CPU by their own specifically crafted means that suit their need. (However the aproach explained above would fit quite well).

c-ohle commented 2 years ago

Hi, thanks. I made a quick check:

var t1 = Environment.TickCount;
for(int i = 0; i < 10_000_000;i++) { var t = rat.task_cpu; }
var t2 = Environment.TickCount; // t2 - t1 -> 78 ms
for (int i = 0; i < 10_000_000; i++) { var t = async_cpu; }
var t3 = Environment.TickCount; // t3 - t2 -> 172 ms

78 : 172 ms, that's not bad, much better than I expected, since with async there is always a dictionary access behind:

  // in AsyncLocal<>.Value,  the critical path
  ExecutionContext? current = Thread.CurrentThread._executionContext;
  // ...
  current.m_localValues.TryGetValue(local, out object? value);

async_cpu I implemented as:

private static AsyncLocal<rat.CPU>? _async_cpu;
public static rat.CPU async_cpu
{
  get
  {
      var p = _async_cpu ??= new AsyncLocal<rat.CPU>();
      return p.Value ?? (p.Value = new rat.CPU());
  }
}

It would be perfect to find a way to have only one task_cpu that has the properties of async_cpu but is 3 times faster. In other words, finding a fast method to check when task_cpu should check an private AsyncLocal instance - not always. This would be a perfect and safe solution.

c-ohle commented 2 years ago

Hi, many thanks again for pointing out this issue. After long time, lot of trials and headache... AsyncLocal also not work for all cases... now the simple solution with minimal overhead: For the safe end user API there is now a:

public readonly ref struct SafeCPU 
{
  readonly CPU cpu;
}

This ref struct encapsulates simply the shared ThreadStatic CPU that is private now. As ref type, it cannot be used in asynchronous methods, asynchronous lambda expressions, query expressions, iterator blocks or inside non-static nested functions. The compiler does not allow this. But for such cases it is safe to use a local new created instance of a CPU object.

c-ohle / RationalNumerics

Possible issues may arise when using this code in asynchronous methods #2