EvgSkv / logica

Logica is a logic programming language that compiles to SQL. It runs on DuckDB, Google BigQuery, PostgreSQL and SQLite.
https://logica.dev
Apache License 2.0
1.88k stars 95 forks source link

Occasional race condition during compilation #282

Closed rvalek closed 1 year ago

rvalek commented 1 year ago

Hi @EvgSkv!

Problem statement:

I am noticing a rather infrequent crash during compilation:

  File "/.venv/lib/python3.9/site-packages/logica/compiler/universe.py", line 457, in __init__
    rules = self.UnfoldRecursion(rules)
  File "/.venv/lib/python3.9/site-packages/logica/compiler/universe.py", line 506, in UnfoldRecursion
    annotations = Annotations(rules, {})
  File "/.venv/lib/python3.9/site-packages/logica/compiler/universe.py", line 138, in __init__
    self.annotations = self.ExtractAnnotations(
  File "/.venv/lib/python3.9/site-packages/logica/compiler/universe.py", line 396, in ExtractAnnotations
    ql = expr_translate.QL(Thrower(), ThrowException, ThrowException,
  File "/.venv/lib/python3.9/site-packages/logica/compiler/expr_translate.py", line 146, in __init__
    self.built_in_functions = copy.deepcopy(self.bulk_functions)
  File "/usr/local/lib/python3.9/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/local/lib/python3.9/copy.py", line 229, in _deepcopy_dict
    for key, value in x.items():
RuntimeError: dictionary changed size during iteration

(the line numbers might be a bit shifted because I caught this on a slightly outdated compiler version -- this shouldn't relevant here).

This is coming from inside the builtin copy.deepcopy, and indicates that the number of keys of the dict in self.bulk_functions was modified while it was being copied. I am getting this in a python process that may concurrently run several compilation (LogicaProgram(parsed_logica, user_flags=flags).FormattedPredicateSql(...)) threads.

So, the guess is that one thread crashes during copy.deepcopy while another is somehow modified the same dict.


Sentiment:

I know that compiler thread-safety is nowhere guaranteed, so I would totally understand if you chose to disregard issue.

That said, I chose to post here because I am a little dumbfounded: I see this error periodically happening on this threaded service, but I can't reproduce it directly. I am also out of ideas for now as to what causes it, because going through the code I reach the following conclusions:

  1. If compilation in thread 2 is modifying the dict in self.bulk_functions that is simultaneously being copied in another compilation in thread 1, then the thread 2 is at a stage of compilation that is before or after making it's own copy.deepcopy.
  2. It can't be AFTER, because once thread 2 compilation has made it's own deepcopy, it will operate on it and will not touch the original.
  3. All operations that compiler does to the bulk_functions dict BEFORE making a deep copy must've already been done, inside the thread 1 compilation (because it has reached copy.deepcopy and crashed), and looking at the code I see that once bulk_functions is populated (prior to copying), it will not be changed.

So, I can't even put a finger or a specific place where thread 2 might be mutating bulk_functions during it's own compilation to cause a crash in thread 1's compilation at copy.deepcopy.

What am I missing? Do you have an idea where I should be looking? Does the reasoning make sense? Is the whole direction wrong?

P.S.

While analysing the above, I was trying to understand the relevance of CleanDictionary, as I see it happens sometimes, but I never actually reach deletion of any keys when I try to go step-by-step with the debugger. Either way, it shouldn't be related to the above issue because CleanDictionary operates on the deepcopy after it has been made, so it can't touch the original dict, right?

Any ideas on this subject would be appreciated! Sorry for a tl;dr...

EvgSkv commented 1 year ago

Hi @rvalek , sorry, just saw the issue! I will dive in and see how complex is it to fix. Thanks for bringing it up!

EvgSkv commented 1 year ago

Hi @rvalek , can you please try ti2023 branch to see if it helps with the issue?

git clone -b ti2023 https://github.com/evgskv/logica

It does look like there might be some kind of race condition in the way main is written, but it's hard to be sure. Please let me know if the fix is works or not.

EvgSkv commented 1 year ago

PS: It should also be easy to manually patch in the change into your fork if needed: https://github.com/EvgSkv/logica/commit/7164c91b96f8d29264d9e6aa049b0937610b4673

rvalek commented 1 year ago

Hey @EvgSkv! Thank you for looking into it! We applied your suggested change to try it out. I'll let you know whether the error is gone by the end of the week (it was occurring pretty infrequently, 5-10 times a week).

rvalek commented 1 year ago

@EvgSkv sorry to report that the errors are still happening. We noticed a few within an hour of applying your change :\ The issue seems to be elsewhere. I shared my meager efforts at analysis it in my original message, in case it makes sense.

EvgSkv commented 1 year ago

Hi @rvalek it does look to me that there was a possibility of the dictionary being modified while copying, even though I can't reproduce it myself. And as I was trying to debug it further I understood that my previous fix had a bug and wasn't really a fix at all. I've pushed a new version to the same branch (https://github.com/EvgSkv/logica/commit/ec506cd7088fec575c8e007f7cc136fd663f08b4)

Can you please try patching that and let me know if it helps?

Apologies for the bug, I am sure it won't take too many iterations to have it eliminated. And thank you for your help! Evgeny.

KantorSerhiy commented 1 year ago

Hey @EvgSkv! Thanks for this solution, we tested for a week and it works for us :)

EvgSkv commented 1 year ago

Great to hear that @KantorSerhiy ! 😄

rvalek commented 1 year ago

thanks again for helping out, @EvgSkv ! closing the issue