apache / trafficserver

Apache Traffic Serverâ„¢ is a fast, scalable and extensible HTTP/1.1 and HTTP/2 compliant caching proxy server.
https://trafficserver.apache.org/
Apache License 2.0
1.8k stars 798 forks source link

Fix race conditions in Regex.cc. #11399

Closed ywkaras closed 3 months ago

ywkaras commented 4 months ago

These changes should prevent (low-frequency) crashes that are being seen in Yahoo Prod.

cmcfarlen commented 4 months ago

Should this be added to 10.0.x project as well for cherry-pick?

ywkaras commented 4 months ago

Should this be added to 10.0.x project as well for cherry-pick?

Debatable. Any crashes would happen at shutdown, and seem infrequent (with limited testing in Yahoo Prod).

ywkaras commented 4 months ago

This is where the segment violation is occurring in Yahoo Prod: https://github.com/apache/trafficserver/blob/5f52c374dfcc3dc223a7c23bd885bac64ec2398f/src/tsutil/Regex.cc#L114

cmcfarlen commented 3 months ago

Could also look at how Metrics storage lifetime outlasts the threads and see if this would work for _Data lifetime:

https://github.com/apache/trafficserver/blob/master/src/tsutil/Metrics.cc#L34-L42

ywkaras commented 3 months ago

Yes, it would not be a logic error in creating automatic, function scope instances of RegexContext when one is needed. The issue is whether the performance hit is tolerable.

On the other hand, this change involves multiple threads/cores hardware locking a variable in cache, which is also a potentially significant performance hit.

ywkaras commented 3 months ago

Could also look at how Metrics storage lifetime outlasts the threads and see if this would work for _Data lifetime:

https://github.com/apache/trafficserver/blob/master/src/tsutil/Metrics.cc#L34-L42

Because of https://github.com/apache/trafficserver/blob/0a53513961cbe55a6bfde32be208c0e04a70b0f1/src/tsutil/DbgCtl.cc#L161 , RegexContext::get_instance() cannot cause the construction of a thread_local variable with a non-trivial destructor. The Metrics code is not limited by that restriction.

I guess I should look into changing DbgCtl:::_new_reference() so it doesn't call tag_activated() with its mutex locked.

ywkaras commented 3 months ago

Closing this because I think https://github.com/apache/trafficserver/pull/11416 is better.