getgauge / gauge-dotnet

C# runner for gauge + DotNet standard 6.0-8.0
Apache License 2.0
27 stars 20 forks source link

ScenarioDataStore broken in version 0.6.0 - race condition. #204

Open jensakejohansson opened 7 months ago

jensakejohansson commented 7 months ago

@sriv When adding values to data store and then retrieving the value Gauge returns null intermittently. Seems to be race condition.

Related to previous PR to handle async methods: https://github.com/getgauge/gauge-dotnet/issues/199

Example to recreate problem can be found here: https://github.com/jensakejohansson/gauge-async

When I execute the scenario it seems to fail in about 2/3 of executions. It's just 2 steps. One that adds a value to the store and the next step tries to retrieve the stored value again.

sriv commented 7 months ago

thanks for reporting, I will investigate. the async change was big and I was surprised that nothing broke. The randomness explains why the tests didnt catch it.

jensakejohansson commented 7 months ago

I think you should consider rolling back the release 0.6.0 if possible. People who runs test executions in a pipeline might just install the latest plugin version by default and then suddenly their tests starts failing due to a not so clear reason. I just got a mail about that... 😬

sriv commented 7 months ago

Done, 0.6.0 is marked as a pre-release and the release metadata has been reverted from gauge-repository.

Apologies for the mess.

sriv commented 7 months ago

I did some investigation. Thanks to your sample project I could replicate the issue. I believe it is because the datastores use a ThreadLocal<ConcurrentDictionary> to hold the data, and threadlocal+async do not go well together.

I will try changing the Gauge.CSharp.Lib project to use something else instead of a ThreadLocal<ConcurrentDictionary> and check.

sriv commented 6 months ago

I did some analysis - and my suspicion was right. Gauge.CSharp.Lib provides the types for DataStores which internally uses a ThreadLocal<ConcurrentDictionary>

The context here is that the initial releases of gauge supported parallel execution by bringing up as many runner processes as the number of streams. This had some limitation in terms of the footprint used and more importantly not very container friendly.

Subsequently, gauge supported multithreaded parallel execution for languages that have good multi threading support. However there was a bit of backward compatibility to be honoured. In a multi process execution, each stream had its own datastore instantiated. In a multi threaded environment, there was one datastore shared across streams causing contention and race conditions. As a result, to mimic the multi process datastore setup, it was decided to make the datastores threadlocal.

In the async world, the threadlocal does not help because there is no guarantee that tasks will run on the same thread because of the design of async. The reasonable replacement for this would be AsyncLocal which ensures isolation between the async runcontexts. But this may not help here too much.

So the choice is - break backward compatibility and have single datastores per process or not have the async support.

I am investigating if there is a way to do both, but I am not very hopeful given the fundamental difference between async paradigm and a sticky thread execution.

I see 2 options for this:

  1. Introduce a separate datastore API for Async
  2. Ditch the threadlocal and let the runtime control the async read/writes.

Happy to hear suggestions.

/cc @getgauge/core (if anyone has any ideas on this)

jensakejohansson commented 6 months ago

Ouch! I see the problem.

Is it a correct understanding that option 2 above would solve the async problem and parallelization could be still be done, but less efficient by spawning processes instead of threads - potentially breaking some sort of backward compatibility in Gauge?

If so, then to me personally that would be the best option since async support (and a clean/simple to use API) is more important to us than efficient parallelization. We primarily run end-2-end user interface based test scenarios in Gauge, seldom in parallel at all, and if so, only at a very limited scale where some extra overhead would be totally acceptable. The lack of async support however is a major thing since many 3rd-party libraries/tools expose async APIs.

However, this is our specific use case, who am I to say what is best for all users?...

mpekurny commented 5 months ago

I would agree with above. I think option 2, having async work would be beneficial.

jensakejohansson commented 2 months ago

@sriv I have to come back to this issue again. Would you have the possibility to implement a solution for this so async is supported, even if it means compromises and it's not an 'optimal' solution? We really need a way forward here and any help would be much appreciated!

Best regards,

sriv commented 2 months ago

apologies @jensakejohansson - work has kept me very busy. I will try to see if I can spend some time over the weekend. Appreciate your patience.

Necobee commented 1 month ago

Hello @sriv - I came across this issue and I noticed that there have been no updates since Jul 2. Do you know if it is still in the plans to implement option#2?

jensakejohansson commented 1 day ago

@sriv Sorry for nagging, but I guess you haven't have the possibility to address this?

It's hard for me to argue for using Gauge in new projects if it cannot handle async libraries. We'd like provide a solution ourselves, but I think it will be difficult due to our limited knowledge of all the technical details and history / potential backward compatibilities etc. related to this.

However, if you feel you are unable to address this I guess we will give it try and do our best (when we find time). If so, any guidance of what changes needs to be done and other pieces of advice are welcome.

Best regards,

mpekurny commented 1 day ago

@sriv I looked over the PR you added to the library, and I have some concerns about the solution proposed. While it might "work", I think it would have a very negative impact on running gauge multi-threaded. I am not sure if this is a problem, maybe most people do not run Gauge multi-threaded.

I have been working on a different solution to the async/await problem which would work similar to how the .NET gauge solution works now, but is not backwards compatible between the library and plugin (meaning you would have to upgrade both at the same time); however, it would just work without any new settings or changes to client's tests.

sriv commented 1 day ago

@mpekurny There were issues reported with multithreading and I had to rollback the release. So no surprises there :).

I think I am ok to let go of backward compatibility in this particular case for mainly the reason that no one has spoken strongly about it in this scenario and I would imagine that the refactoring to accomodate the new changes would not be drastic.

Would you be open to raise a PR for the components?

@jensakejohansson - I was thinking about a way to handle the backward and raised a PR for Lib. But I am interested in @mpekurny's idea.

mpekurny commented 1 day ago

@sriv @jensakejohansson @chadlwilson

Would you be open to raise a PR for the components?

Yes, when it's ready. With other work commitments, I am probably another week or so before I would be ready to raise PRs for the changes (it would require changes in both the plugin and library). If you could hold off merging in your changes until then?