caleb531 / automata

A Python library for simulating finite automata, pushdown automata, and Turing machines
https://caleb531.github.io/automata/
MIT License
340 stars 65 forks source link

Make Automaton instances immutable #71

Closed caleb531 closed 1 year ago

caleb531 commented 1 year ago

@eliotwrobson I want to run something by you that I've been mulling over for the past week.

Part 1: Precomputing lambda closures was a good idea

When you integrated networkx into the library back in ed1cdffcccb90a03f1892147f42ea1a9f62414f6, you added logic to precompute+cache the lambda closures for an NFA (via self._lambda_closure_dict). Previously, these lambda closures were (arguably excessively) computed on the fly when reading input via my NFA._get_lambda_closure method.

Given the performance/efficiency consequences of my previous approach, I very much appreciate your contribution here (enough so that I am planning to expose this dictionary as part of the public API, as you may have noticed in my recent commits to develop).

Part 2: Internal Automaton state

However, your changes do introduce an interesting dilemma. You see, before Automata v6 was released with your precomputed lambda closure logic, NFA instances could be mutated without consequence just like every other Automaton instance in the library.

But now that lambda_closures (a derived attribute, in database terms) is in the picture, the NFA instance could end up in a situation where the user updates states, transitions, and/or input_symbols, leaving lambda_closures stale.

As I see it, there are three solutions to this stale lambda_closures situation if we want NFAs to remain fully mutable:

  1. We watch for changes to states, input_symbols, and transitions (which is a nested structure) and automatically recompute lambda_closures when any of the aforementioned attributes change; this is the ideal solution, but not sure how feasible it is to implement

  2. We require the user to explicitly recompute the lambda closures (via some public method) when they choose to mutate the NFA. This is not my favorite approach, as the precomputing of lambda closures is really an implementation detail the user should not be concerned with. But it would balance implementation simplicity with performance.

  3. We revert to caching nothing and computing the lambda closures on the fly when reading input; naturally, this would re-introduce a significant performance consequence, especially if you want to read many input strings with the same NFA instance

Of course, if we make NFAs fully immutable, then this problem goes away, and we don't need to worry about recomputing lambda_closures post-instantiation.

Part 3: Mutable or immutable?

So far, my blurb above is focused primarily on NFAs, but I think it opens up a larger question about whether any automaton should be a mutable structure, or whether it should be immutable. My ultimate aim for this library is to have a consistent and friendly API that makes working with these data structures pleasant and practical. And right now, NFAs are inconsistent with the rest of the Automaton sub-types because NFAs have some additional internal state (i.e. lambda_closures) which must be managed.

Questions for you

Therefore, I will sum up my musings with a few questions regarding your own use of this library:

  1. How frequently are you reading input with the same automaton?
  2. How frequently (if ever) do you mutate an automaton after its instantiation?
  3. What are your thoughts about making all automata instances immutable?
  4. Anything else you want to add?

@abhinavsinha-adrino @Tagl If you guys happen to have a moment, I would love any feedback you have as well. This library isn't any good if it doesn't serve real-world use cases for working with these machines.

Much thanks in advance, Caleb

eliotwrobson commented 1 year ago

@caleb531 to give you some insight, I'll discuss our use cases in a little more detail and why I made this particular change (along with some of the other PRs I've been submitting in the last few months).

Use Cases

This package (and specifically the NFA and DFA libraries) are what's driving the autograder code for problems where students are asked to write DFAs, NFAs, and regular expressions that match a given reference solution. Since we want to check that student answers are equivalent to reference solutions as sets, we convert each of these entities to DFAs and use the __eq__ function to check equivalence (this is why we wrote an optimized equality checking function, submitted as part of v6).

Importance of Precomputing Lambda Closures

Although precomputing the lambda closures makes reading strings more efficient (as you stated), the real benefit comes in the use cases I outlined above. In the NFA to DFA conversion algorithm, the lambda closures were being computed on-the-fly for each state in each subset of states as they were being processed. Since conversion is an exponential time algorithm, not having the lambda closures cached made converting even moderately large NFAs prohibitively slow. This was especially a problem when parsing regular expressions, since those tend to produce large NFAs with a significant number of epsilon transitions.

With respect to the solutions you pose in part 2, this really makes number 3 infeasible (since the new regex parsing code is optimized for large regular expressions).

Mutability

With respect to mutability, I actually converted our fork to use immutable automata a few months ago (disabling the __setattr__ methods). In doing this conversion, I don't think there were any instances of client code mutating an automata after initial construction (outside of test cases meant to test for invalid inputs). The main effort was in converting the algorithms in the library to not change the automata after initial construction. I think the reason for this is that it's much easier to manipulate automata using the builtin operations (subtraction, union, etc.) than manually changing the internal data stored. I think this could be made even easier by the inclusion of a few basic DFA creation functions (i.e. make a DFA that accepts a finite language given as input, a DFA that accepts all strings containing a target substring, etc.). An example of another package including some of these functions can be found here.

Summary

I realize that was a bit long, but hopefully it gives some insight into how this code is getting used in the wild. To address each of your questions specifically:

  1. In our grading, we generate and read a bunch of strings as potential counterexamples for automata students provide as input. This is our code specifically:
    
    false_positives: List[str] = []
    false_negatives: List[str] = []

for bitstring in strings_of_length_at_most_n(0, max_length_to_check, alphabet=submitted_dfa.input_symbols): accepted_by_reference_DFA = reference_dfa.accepts_input(bitstring) accepted_by_submitted_DFA = submitted_dfa.accepts_input(bitstring)

if not accepted_by_reference_DFA and accepted_by_submitted_DFA:
    false_positives.append(bitstring)
elif accepted_by_reference_DFA and not accepted_by_submitted_DFA:
    false_negatives.append(bitstring)

2. I couldn't find any instances of mutating an automaton in a significant way in our codebase, outside of using the built-in operators (which produce a new automaton, and can be easily rewritten to respect the immutability).
3. I think making all automaton instances immutable is such a good idea I already did it! I don't think there's any real use case for mutating an automaton where a new one can be created instead (copying the sets used in other automatons if needed). We didn't do this, but you can more rigidly enforce this immutability by using a [frozendict](https://pypi.org/project/frozendict/), which prevents accidental modification. Furthermore, I think it's a bit strange to allow mutable automaton while having validation functions designed to prevent the client code from instantiating an automaton that is invalid (but can be made invalid by changing things later).
4. Just wanted to add that, since we've made some of these changes, I can provide some of the code we used / changes made to the DFA / NFA methods to comply with the immutability requirement. Though it may be considerably more effort since this requires changing your tests, and we only have code for DFA / NFA (our fork is targeted towards regular languages).

Thanks for getting this far, and I hope that I've been able to provide some insight on where you'd like to take the package!
caleb531 commented 1 year ago

@eliotwrobson Wow, thank you for the thorough analysis! I am definitely leaning towards your view of making all automaton instances immutable, which would definitely represent the next natural evolution of this library. You also make an excellent point about the validation, as this is something that's always bothered me a little deep down. 🙃

Considerations with immutability

Okay, so if we are on board to making automaton types immutable, there are several considerations that come to mind:

  1. This immutability initiative would not just apply to DFA and NFA, but should apply to GNFA, DPDA, NPDA, DTM, NTM, and MNTM (basically any type that ultimately inherits from Automaton); so all of those classes would need to be refactored as well
  2. As you identified, certain tests would need to be rewritten, mainly those that simply mutate the automaton instance to achieve the desired scenario to test against. This refactor would create a lot more duplication of automaton instantiation code with slight tweaks for each respective test case
  3. I did take a look at frozendict, and think that would definitely be useful for this initiative. Although we would need to apply it to the transitions dict in a way that applies to the sub-dicts as well.
  4. Hopefully, all of this new initialization overhead with the type conversions will not impact performance too much 😅

One more thing: dataclasses

I would also like to consider Python's dataclasses module, particularly the @dataclass decorator with frozen=True to enforce the immutability for all of these automaton subclasses, as well as add other features which these automaton types have been lacking for some time (like prettier __str__ output).

It will definitely involve additional effort to properly initialize all the attribute values (e.g. convert given transitions dict to nested frozendicts). I think this can be done with dataclasses.field(default_factory=...). But I think it will definitely be worth it.

I think we can also refactor the already-immutable namedtuple-based classes (like automata.pda.stack to use @dataclass(frozen=True) instead. Would create more uniformity and would probably make the code a bit cleaner, too.

dataclasses

Where we go from here

Since this represents such a major change (both from an end-user standpoint and also as an implementation effort), I've created a new immutability branch for us to work on these changes (of course, only if you have the bandwidth to contribute what you have). My thought is to use this branch to build out a working implementation of the new immutable API.

Along those lines, I will gladly review any PRs you can send my way for tests, classes, etc.

Okay, those are my thoughts! Thank you for reading this essay of mine 😂.

eliotwrobson commented 1 year ago

@caleb531 Sounds like a plan! I can start copying over pieces of our code that consist of compatibility rewrites, then we can switch over to enforcing the internal immutability later on. A lot of the internal refactors will only apply to operations between objects, so the other classes that don't have as many of these shouldn't require as much work.

With respect to dataclasses, I actually don't know if they're going to save as much boilerplate writing as we might think. It's pretty easy to mimic the behavior of setting frozen=True by overriding some methods, and a bunch of the other default behaviors will need to be disabled (for example, the autogenerated ordering algorithms). But we can cross that bridge when we come to it. The first real step is rewrites to respect immutability.

Tagl commented 1 year ago

I have not been working with the library for quite some time. However, I never had to alter the automata I used after creation. I worked with them as if they were immutable. If I needed to mutate I would perform all mutations myself. Otherwise I used the union/intersect/etc. methods to mutate and construct more complicated DFAs. However I must note that I was working with permutation patterns so my bottleneck in performance (after improving the time complexity in the library) was more in regards to combinatorial explosion of the set of permutations rather than anything to do with the state machines. I think immutability is a good choice.

caleb531 commented 1 year ago

@eliotwrobson

With respect to dataclasses, I actually don't know if they're going to save as much boilerplate writing as we might think. It's pretty easy to mimic the behavior of setting frozen=True by overriding some methods, and a bunch of the other default behaviors will need to be disabled (for example, the autogenerated ordering algorithms). But we can cross that bridge when we come to it. The first real step is rewrites to respect immutability.

Well, I was thinking that @dataclass would already be a nice substitute for the namedtuples used in the PDA stack/configuration classes, so it might be worth adopting for the other library primitives as well. It might also be a way to gradually introduce type annotations into the library.

Regarding the autogenerated ordering, the docs say that the default for the order param is False (which is great), although eq defaults to True (which we will need to override so as to use the proper Hopcroft-Karp equivalence algo). That's not too bad, though.

But besides that, the following dataclass features are desirable for me:

To simplify this across automaton classes, we could also create our own ready-made decorator wrapper using functools.partial.

# maybe put this in automata.base.utills?
import functools
from dataclasses import dataclass

automaton_dataclass = functools.partial(dataclass, eq=False, kw_only=True)

Finally, to run the validate method after initialization, we can use the dataclass post-init processing feature. This will allow us to consolidate the calls to validate because we can move it to a higher subclass like FA or Automaton.

Is this all ambitious? Definitely, but so is the move towards immutability, and I think it's worth exploring while we're making that transition. Plus, I'm personally fascinated by dataclasses and their goal of reducing code reuse.

Anyway, that's me thinking out loud, mostly. I agree that we should wait to start any dataclass work until after the core immutability effort is achieved for all automaton types (PDA and TM, included).

caleb531 commented 1 year ago

@eliotwrobson So I've been playing around with dataclasses, and I've reached a dilemma: because we are type-casting the input parameters, the types of the input arguments are not necessarily the same as the types of the attributes that end up on the automaton. This ambiguity seems to confuse type checkers (like mypy) in my testing.

For example, a DFA can currently accept any iterable sequence as the value for states parameter, which later gets converted to a frozenset. So for the dataclass definition, which of the following do I specify as the type of states?

  1. states: Iterable
  2. states: set
  3. states: frozenset
  4. states: set | frozenset

We could technically remove the types the following ways, but I feel that defeats the purpose.

And what if someone passes a tuple as the states parameter? We want to allow that too, since that will get converted to frozenset anyway (and is just more flexible). So I guess the final type signature could look something like:

states: Iterable

Example

Here's an example script I've been playing with to assess the benefits/drawbacks of dataclasses:

import functools
from dataclasses import dataclass
from typing import Hashable
from frozendict import frozendict  # type: ignore

automaton_dataclass = functools.partial(
    dataclass,
    frozen=True,
    eq=False, kw_only=True)

class frozen_transition_dest():
    def __new__(cls, dest):
        if isinstance(dest, set) or isinstance(dest, frozenset):
            return frozenset(dest)
        else:
            return dest

class frozen_transition_dict():
    def __new__(cls, transitions):
        return frozendict({
            state: frozendict({
                symbol: frozen_transition_dest(dest)
                for symbol, dest in paths.items()
            })
            for state, paths in transitions.items()
        })

@dataclass(frozen=True, eq=False, kw_only=True)
class NFA():

    states: set | frozenset
    input_symbols: set | frozenset
    transitions: dict[Hashable, dict] | frozendict[Hashable, frozendict]
    initial_state: Hashable
    final_states: set | frozenset

    def __post_init__(self):
        object.__setattr__(
            self, 'transitions', frozen_transition_dict(self.transitions))

nfa = NFA(
    states={'q0', 'q1', 'q2'},
    input_symbols={'0', '1'},
    transitions={
        'q0': {'0': {'q0'}, '1': {'q1'}},
        'q1': {'0': {'q0'}, '1': {'q2'}},
        'q2': {'0': {'q2'}, '1': {'q1'}}
    },
    initial_state='q0',
    final_states={'q1'}
)
assert type(nfa.transitions) == frozendict
assert type(nfa.transitions['q0']) == frozendict
assert type(nfa.transitions['q0']['0']) == frozenset

Anyway, what are your thoughts on this? Do you have a suggestion for the type signature? Is this just evidence that dataclasses are getting in the way? (I'd like to believe not, but you tell me)

Update

I'm realizing there are other problems with dataclasses, like the ugly use of object.__setattr__ to type-cast automaton attributes to their correct types (e.g. frozenset and frozendict). So maybe I really should avoid integrating them for now and focus on the goal of core immutability as you have been. 😅

Could you please provide the __setattr__ override you have been using to disallow assignment to automaton attrs?

eliotwrobson commented 1 year ago

@caleb531 your discussion above is why I opted against dataclasses for our local implementation, there's just too many things you have to work around for it to be worthwhile (and these are all things that can be done from scratch without that much trouble).

As for our local implementation, we have these overrides in the base automaton class:

  def __delattr__(self, name: str) -> NoReturn:
      "Set custom delattr to make class immutable."
      raise AttributeError(f'This {type(self).__name__} is immutable')

  def __setattr__(self, name: str, val: Any) -> NoReturn:
      "Set custom setattr to make class immutable."
      raise AttributeError(f'This {type(self).__name__} is immutable')

Then, unfortunately in the initialization functions we have to use the object.__setattr__ function, for example object.__setattr__(self, "states", states.copy()). However, the alternative is to have a single boolean in the base class that determines whether the class gets frozen, and that ultimately seemed like more work than it was worth. See: https://medium.datadriveninvestor.com/immutability-in-python-d57a3b23f336

EDIT: To give more detail on what I meant above, one of the big issues with dataclasses here is that the types we want to give to the constructors (arbitrary iterables) and the types that the automata hold internally (frozensets) are different, and that's one of the things that dataclasses don't play well with. One of the main features of dataclasses is autogenerating boilerplate constructors, but in this case it's about the same amount of effort to write the constructors by hand (since the input types differ from the internal ones). Not to mention this won't play well with client code if they're using static typing.

caleb531 commented 1 year ago

@eliotwrobson:

your discussion above is why I opted against dataclasses for our local implementation, there's just too many things you have to work around for it to be worthwhile (and these are all things that can be done from scratch without that much trouble).

Ha yes indeed. Thank you for your patience while I learned that the hard way. 😅😜

Then, unfortunately in the initialization functions we have to use the object.setattr function, for example object.setattr(self, "states", states.copy()). However, the alternative is to have a single boolean in the base class that determines whether the class gets frozen, and that ultimately seemed like more work than it was worth

Ever since I read this caveat in my email this morning, I've been thinking about a best-of-both-worlds solution, and I think I have it. Effectively, we can have one of the automaton parent classes handle the object.__setattr__ iteratively, and have each subclass pass its specific input parameters to the parent __init__ as keyword args.

Let me pseudocode this for reference:

class Automaton():
    def __init__(self, **kwargs):
        for attr_name, attr_value in kwargs.items():
            object.__setattr__(self, attr_name, attr_value)
        self.__post_init__()

    def __post_init__(self):
        self.validate()

class FA(Automaton):
    pass

class NFA(FA):
    def __init__(self, *, states, input_symbols, transitions, initial_state, final_states):
        super().__init__(
            states=frozenset(states),
            input_symbols=frozenset(input_symbols),
            transitions=frozendict(transitions),
            initial_state=initial_state,
            final_states=frozenset(final_states)
        )

    def __post_init__(self):
        self.recompute_lambda_closures()
        super().__post_init__()

I'll implement this now to make sure this actually works, but I'm optimistic that something like this will be the way to go.

caleb531 commented 1 year ago

@eliotwrobson Great news! Not only do I have DFA, NFA, and GNFA fully immutable, but they are also fully hashable. But one small caveat has come up:

The Equality Problem

While reading up on the __hash__ method, I discovered a Python requirement stating that two objects that are considered __eq__ are also implied to have the same hash. Currently, this is true for NFAs because I implemented a strict value-based equality check.

User-defined classes have __eq__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns an appropriate value such that x == y implies both that x is y and hash(x) == hash(y).

However, the DFA __eq__ method is currently using the Hopcroft-Karp to check for equivalence, which opens up the possibility of two DFAs of different hashes being "equal" because they are considered equivalent per this algorithm.

Proposed Solution

So it almost seems to me like we should rename the current DFA.__eq__ method (which uses this formal 'equivalence-based' approach) to something like is_equivalent. Then we can reserve __eq__ for strict value-based equality like I described above, which will ensure it's consistent with the output of __hash__.

What do you think?

Tagl commented 1 year ago

Sad that we cannot use the equals operator in a mathematical sense but if you want them to be hashable then I think there's no way around it.

eliotwrobson commented 1 year ago

@caleb531 I left a brief note on one of your commits, and I think there's a few ways you could take things. From a theoretical POV, the most natural way to view an automata (especially DFAs and NFAs that can be manipulated easily) is as an abstract representation of a set (which is just the set of strings it accepts, possibly infinite). Things like the state names, and really the transitions themselves are really more "implementation details" than anything else. Thus, I think it's really superfluous (and really not appropriate) to support functionality that checks whether things like the state names are identical (think of this as two sets in Python, for example {1,2} == {2,1}, as their contents are the same).

However, I understand that this viewpoint means there's no good way to implement the __hash__ function (as the hash would have to have some fingerprint for the set the automaton represents), but I'm also not quite sure what the point of having a hash function would be (since using these as something like dict keys is akin to having a set() as a dict key).

Tagl commented 1 year ago

I however don't think it is necessary that they are hashable. I have used sets as keys many times but those are all finite sets. The only use case I have in mind for my work would require the hashes to be equal if they (possibly infinite) sets represented are equal.

Tagl commented 1 year ago

Perhaps we can figure out some decent hashing function that is slow which would allow us to handle that use case. But the only ones I can think of at the moment are bound to have many collisions. So I would vote that equals operator remains mathematical and we rather try to be smart about the hashing if we want that.

caleb531 commented 1 year ago

@eliotwrobson @Tagl You have some good points about practical usage and the underlying theory, which are both important to me. And another important factor for me is creating a library that fits into Python's orthogonal language design, as it makes the API more friendly for Python developers to consume. To me, it's all a balance, and I want to consider your feedback carefully.

Bear with me: I will get to addressing hashing and equality, but I first need to lay the groundwork for my POV:

How immutable?

This whole conversation started with immutability, and making all automaton types immutable. From the start, I knew I wanted this goal to include freezing the individual automaton attributes (e.g. converting states to a frozenset, etc.), beyond simply locking down __setattr__ and __delattr__. @eliotwrobson convinced me of this with regards to validation:

I think it's a bit strange to allow mutable automaton while having validation functions designed to prevent the client code from instantiating an automaton that is invalid (but can be made invalid by changing things later).

If we disallow dfa.states = {} but still allow dfa.states.add('non_existent_state'), then we still have a mutable primitive that arguably necessitates validating again later.

Now to the hashing part

So what does all that have to do with hashing and equality? A few things:

  1. Given my reasoning above, we've made DFA/NFA instances fully immutable all the way down, attributes included. And as a side effect, those attributes all happen to be fully hashable as well. But if the automaton instance is not itself hashable, then (I would argue) it's inconsistent with Python's existing design. If you think about all of Python's built-in immutable data types (tuple, str), or third-party ones like frozendict, they are all immutable and hashable. So I feel a nudge to follow that convention.

  2. If I were to keep DFA/NFA as unhashable types, this would set a precedent that all __eq__ methods should strictly be used for the proper equivalence checking rather than value-based equality. Because DFA.__eq__ is already like that, so we should implement it for NFA.__eq__ as well, and if we implement __eq__ for other automaton types in the future, they should follow this precedent. But that would lock us into a situation where, from an API change standpoint, it would be much more difficult to introduce hashable automaton types if that were ever desired in the future.

  3. I can understand the appeal of using == for equivalence-checking—it's succinct, readable, and consistent with other overridden operators like + for NFA concatenation. But == / __eq__ does carry special meaning in Python, and is used in many places to unify the language.

  4. If I do decide to switch == to be strictly value-based equality, then the DFA equivalence-based algo would still be available, just via a method like is_equivalent_to

Conclusion

Nothing's set in stone yet while we're still having this discussion. I am trying to listen thoughtfully to your points of view, especially since you guys have contributed amazing work to this project. I am just trying to consider these big decisions from every angle, and I believe this discussion will help in making the best-informed decision.

eliotwrobson commented 1 year ago

@caleb531 thank you for considering our input, and for your work on this project as well! I think that I'm inclined to agree with @Tagl , we should really have a hash function that somehow represents the underlying set if we do it at all. Even though this is somewhat inconsistent with the way Python's immutable types work, you would have a hash function that relies on (effectively) implementation details instead of the object being represented (that being the accepted language). Although having a hash function would definitely be nice in this circumstance, I think that implementing internal value-based checking at all is really putting the focus in the wrong place (away from the language that's being accepted). Unless this causes some issues with the language itself, I'd strongly prefer to leave it out (and anyway, if this functionality is needed it can be implemented by the client code without too much trouble).

However, I actually think that an interesting hash function respecting this could be developed without too much trouble.

eliotwrobson commented 1 year ago

Unrelated, but @Tagl if you have code that's being used for creating Automaton's representing a finite language, could you include that on an issue or submit a PR? I think it would be great for the library to include some standard creation functions for people.

caleb531 commented 1 year ago

@eliotwrobson Thank you for your feedback—I think I finally understand what you mean regarding the automaton attrs being implementation details and how having a hash function based on those attrs would not really be representative of the underlying language that the automaton instances actually represent.

With that, you have persuaded me that a value-based equality algorithm would not be very useful for this library. I also suppose if someone ever wanted to hash an automaton based on its attrs, they could compute the hash via something like frozendict(dfa.__dict__).

However, I do have a few requests if we are going to embrace the theoretical/equivalence-based approach:

  1. Implementing an __eq__ equivalence algorithm for NFAs, if one exists. Because right now, DFAs are the only automaton type that have such an equivalent algorithm implemented
  2. Implementing more operations for DFA; right now, NFA has union, concatenation, and reversal operations, and I would love to see DFA to have those operations as well
  3. Implementing __le__ and __ge__ for NFA, since DFA already has those (again, assuming this is possible)
  4. If possible, implement the hashing algorithm you describe; a quick search brought me to here and here, but both you guys obviously know what you're doing, so I will leave it up to you to decide on the right algo

What do you think? Are these things doable? Do you have the time for these additions?

eliotwrobson commented 1 year ago

@caleb531 With regard to your items:

  1. There is a __eq__ algorithm for NFAs, as I mentioned in another comment, it's in the same paper as the one linked in the algorithm for DFAs. The algorithm is almost identical to the DFA algorithm, besides a change to the way transitions are accessed. I can try to take a look later, but it would be great if you could take a stab at it.
  2. Some of these might be tricky, but there are definitely more algorithms out there. For some, they effectively boil down to translating the DFAs to NFAs and then converting back to a DFA.
  3. See above, might be possible but certainly doable after conversion.
  4. Will take a look at some of the links you posted when I have time, but adding hashing might be doable.

In the same vein, I think we should try and incorporate some constructions from other theory-oriented packages dealing with *FAs, including this Mathematica package and this Java library. I can definitely help out with implementing some of this (though I have yet to download Mathematica to see what's going on in that package).

Tagl commented 1 year ago

Unrelated, but @Tagl if you have code that's being used for creating Automaton's representing a finite language, could you include that on an issue or submit a PR? I think it would be great for the library to include some standard creation functions for people.

I don't have anything that's general written down in code but I could add a method that constructs a DFA from a finite set of strings. The DFA is constructed as a prefix tree, then it can be minimized if desired. Aho-Corasick is also relevant here.

However, I do have a few requests if we are going to embrace the theoretical/equivalence-based approach:

I think for items 1-3 we can simply convert between DFA and NFA as needed to start with and research for improvements.

caleb531 commented 1 year ago

@eliotwrobson

Sure, I can take a stab at the NFA equivalence implementation. I am attaching the paper I am following for reference (I found it in the your docstring for DFA.__eq__): Regular Language Equivalence.pdf

As easy as it would be to convert the NFA to a DFA and then perform the already-implemented operation, I generally want to see if native implementations can be written for each automaton type independently. If that ends up not being possible, so be it. But I expect minimizing conversions to DFA will result in better performance, in most cases.

eliotwrobson commented 1 year ago

@caleb531 Thank you! Also, I think you're right, minimizing the conversions can really improve things. For example, there's an algorithm to compute the intersection of the language of two NFAs that might be worth including: https://www3.nd.edu/~dchiang/teaching/theory/2020/hw/hw2.pdf

EDIT: Also @Tagl if you have that code sitting around that would be awesome, especially if it implements the Aho-Corasick algorithm directly. Could you post here or in a PR?

Tagl commented 1 year ago

@eliotwrobson I can make a PR but I'll have to add some tests then. Until then here's a prefix tree version. I only have Aho-Corasick written in C++.

    @classmethod
    def from_word_set(cls, word_set, input_symbols):
        """
        Initialize this DFA with the finite language defined by word_set and input_symbols
        """
        dfa_initial_state = 0
        dfa_error_state = 1
        initial_transitions = {symbol: dfa_error_state for symbol in input_symbols}
        dfa_states = {0, 1}
        dfa_symbols = input_symbols
        dfa_transitions = {state: initial_transitions.copy() for state in dfa_states}
        dfa_final_states = set()
        for word in word_set:
            state = dfa_initial_state
            for symbol in word:
                next_state = dfa_transitions[state][symbol]
                if next_state == dfa_error_state:
                    next_state = len(dfa_states)
                    dfa_transitions[state][symbol] = next_state
                    dfa_states.add(next_state)
                    dfa_transitions[next_state] = initial_transitions.copy()
                state = next_state
            dfa_final_states.add(state)
        return cls(states=dfa_states, input_symbols=dfa_symbols,
                transitions=dfa_transitions, initial_state=dfa_initial_state,
                final_states=dfa_final_states)
eliotwrobson commented 1 year ago

I think this is enough I can take it from here. Thank you!

caleb531 commented 1 year ago

Just an update: I am making good progress on the NFA equality algorithm. I still have to compute the transitions properly, but I think I know what I can do for that, which I will try out tomorrow morning.

caleb531 commented 1 year ago

@eliotwrobson Good news! I think I have the NFA equivalence algorithm working! Although I'd like you review it to make sure I have the algorithm right (but only when it's next convenient for you—please don't let me disturb your weekend).

If I wrote this correctly, the algorithm is indeed very similar to the DFA equivalence. The key was to use frozensets as the first value in each pair tuple, and also to use the existing NFA._get_next_current_states to evaluate the transitions (which conveniently takes care of the lambdas for us).

Because we're dealing with sets, I also had to make a change to the is_final_state helper function to take the difference between the set of final states and the set of current states (assuming that at least one current state must be final in order for that check to pass).

Please see the latest commits on the immutability branch for the latest version of the algorithm.

P.S. I also cleaned up the origin_enum logic in the DFA equivalence algorithm to replace the origin_enum conditionals with a dictionary-based lookup. It also now uses the existing DFA._get_next_current_state method to compute the transitions.

eliotwrobson commented 1 year ago

@caleb531 no worries! I'm not busy this weekend. I'm happy to take a look. Per the discussion on that commit, if the algorithm isn't working for testing regex equivalence, then there's a subtle issue in the implementation somewhere. Like a few of the other algorithms, this one is a bit tricky to implement correctly (and why large test cases generated by things like regexes are helpful for debugging).

EDIT: From looking at the original paper, it seems like the NFA's they define don't use epsilon transitions? There could be some subtlety to how that works. For example, in your algorithm you don't take the epsilon closure of the initial states (which might be necessary to make things work right). This might have come up during the regex equivalence because those NFAs tend to have lots of epsilon transitions. You could also try running the epsilon transition removal before equivalence, just to confirm it's an issue related to that.

caleb531 commented 1 year ago

@eliotwrobson Hm, trying a few of those things now, without success. Here's my diff:

diff --git a/automata/fa/nfa.py b/automata/fa/nfa.py
index 9c7fbd7..bc41f4a 100644
--- a/automata/fa/nfa.py
+++ b/automata/fa/nfa.py
@@ -526,24 +526,30 @@ def __eq__(self, other):
         https://arxiv.org/abs/0907.5058
         """

+        first = self.eliminate_lambda()
+
         origin_automata = {
-            OriginEnum.SELF: self,
+            OriginEnum.SELF: first,
             OriginEnum.OTHER: other
         }

         # Must be another NFA and have equal alphabets
-        if not isinstance(other, NFA) or self.input_symbols != other.input_symbols:
+        if not isinstance(other, NFA) or first.input_symbols != other.input_symbols:
             return NotImplemented

         # Get new initial states
-        initial_state_a = (frozenset({self.initial_state}), OriginEnum.SELF)
-        initial_state_b = (frozenset({other.initial_state}), OriginEnum.OTHER)
+        initial_state_a = (first.lambda_closures[first.initial_state], OriginEnum.SELF)
+        initial_state_b = (other.lambda_closures[other.initial_state], OriginEnum.OTHER)

         def is_final_state(states_pair):
             states, origin_enum = states_pair
             # If at least one of the current states is a final state, the
             # condition should satisfy
-            return bool(origin_automata[origin_enum].final_states - states)
+            for state in states:
+                automaton = origin_automata[origin_enum]
+                if automaton.lambda_closures.get(state) in automaton.final_states:
+                    return True
+            return False

         def transition(states_pair, symbol):
             states, origin_enum = states_pair
@@ -567,7 +573,7 @@ def transition(states_pair, symbol):
             if is_final_state(q_a) ^ is_final_state(q_b):
                 return False

-            for symbol in self.input_symbols:
+            for symbol in first.input_symbols:
                 r_1 = state_sets[transition(q_a, symbol)]
                 r_2 = state_sets[transition(q_b, symbol)]

The same regex test still fails, but nothing else fails. Regardless, my existing example NFAs for the NFA equality test are probably not great examples—I just made them the same NFA with different state names. Do you have any better examples you could provide?

eliotwrobson commented 1 year ago

@caleb531 in the diff you posted, you need use the eliminate lambda on both self and other. And I should have some small interesting test cases lying around, will take a look.

caleb531 commented 1 year ago

@eliotwrobson Correction: it looks like the original test that was failing was actually trying to check equality.

All those assertions are passing, but now the one failing assertion is the check for non-equality:

FAIL: test_isequal (tests.test_regex.TestRegex)
Should correctly check equivalence of two regular expressions
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/caleb/Repositories/personal/automata/tests/test_regex.py", line 45, in test_isequal
    self.assertFalse(re.isequal('baaa*b(b|a)|(bab(b|a)|(bb|ba))',
AssertionError: True is not false
caleb531 commented 1 year ago

@eliotwrobson Good call about calling eliminate_lambda() on other. Unfortunately, that didn't change anything either. The assertFalse case is still failing.

first = self.eliminate_lambda()
other = other.eliminate_lambda()

origin_automata = {
    OriginEnum.SELF: first,
    OriginEnum.OTHER: other
}
....
FAIL: test_isequal (tests.test_regex.TestRegex)
Should correctly check equivalence of two regular expressions
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/caleb/Repositories/personal/automata/tests/test_regex.py", line 45, in test_isequal
    self.assertFalse(re.isequal('baaa*b(b|a)|(bab(b|a)|(bb|ba))',
AssertionError: True is not false

----------------------------------------------------------------------
Ran 296 tests in 0.741s

FAILED (failures=1)
eliotwrobson commented 1 year ago

@caleb531 Maybe try asserting NFA equality before and after running eliminate_lambda?

eliotwrobson commented 1 year ago

With respect to our test cases, we have a few small cases checking checking that certain regexes are equivalent to some small NFAs. They're formatted for pytest, but you should be able to call the construction code directly:

@pytest.fixture
def regex_1_nfa() -> nfa.NFA:
    """
    Returns an NFA matching the regex ((01) | 1)*((0*1) | (1*0))((10) | 0)*.
    """
    return nfa.NFA(
        states={'s', 'a', 'b', 'c', 'd', 'e'},
        input_symbols={'0', '1'},
        transitions={
            's': {'0': {'a'}, '1': {'s'}, '': {'b', 'd'}},
            'a': {'1': {'s'}},
            'b': {'0': {'b'}, '1': {'c'}},
            'c': {'0': {'c'}, '1': {'e'}},
            'd': {'0': {'c'}, '1': {'d'}},
            'e': {'0': {'c'}},
        },
        initial_state='s',
        final_states={'c'}
    )

@pytest.fixture
def regex_2_nfa() -> nfa.NFA:
    """
    Returns an NFA matching the regex ((1(010)*(11)*) | ((010)*).
    """
    return nfa.NFA(
        states={'s', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'},
        input_symbols={'0', '1'},
        transitions={
            's': {'0': {'g'}, '1': {'a'}},
            'a': {'0': {'b'}, '': {'d'}},
            'b': {'1': {'c'}},
            'c': {'0': {'a'}},
            'd': {'1': {'e'}, '': {'f'}},
            'e': {'1': {'d'}},
            'f': {'':  {'s'}},
            'g': {'1': {'h'}},
            'h': {'0': {'f'}},
        },
        initial_state='s',
        final_states={'s'}
    )

@pytest.fixture
def regex_3_dfa() -> dfa.DFA:
    """
    Returns a DFA matching the regex (0(0 | 1)*0) | (1(0 | 1)*1).
    """
    return dfa.DFA(
        states={'s', '0', '1', '00', '01', '10', '11'},
        input_symbols={'0', '1'},
        transitions={
            's':  {'0': '0',  '1': '1'},
            '0':  {'0': '00', '1': '01'},
            '1':  {'0': '10', '1': '11'},
            '00': {'0': '00', '1': '01'},
            '01': {'0': '00', '1': '01'},
            '10': {'0': '10', '1': '11'},
            '11': {'0': '10', '1': '11'},
        },
        initial_state='s',
        final_states={'00', '11'}
    )

@pytest.fixture
def regex_4_dfa() -> dfa.DFA:
    """
    Returns a DFA matching the regex ((0 | 1)*00) | ((0 | 1)*11).
    """
    return dfa.DFA(
        states={'s', '0', '1', '00', '01', '10', '11'},
        input_symbols={'0', '1'},
        transitions={
            's':  {'0': '0',  '1': '1'},
            '0':  {'0': '00', '1': '01'},
            '1':  {'0': '10', '1': '11'},
            '00': {'0': '00', '1': '01'},
            '01': {'0': '10', '1': '11'},
            '10': {'0': '00', '1': '01'},
            '11': {'0': '10', '1': '11'},
        },
        initial_state='s',
        final_states={'00', '11'}
    )

@pytest.fixture
def regex_5_nfa() -> nfa.NFA:
    """
    Returns a NFA matching the regex ((((01)*0) | 2)(100)*1)*((1*) | (0*2*)).
    """
    return nfa.NFA(
        states={'s', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'},
        input_symbols={'0', '1', '2'},
        transitions={
            's': {'': {'a', 'f', 'g'}, '2': {'c'}},
            'a': {'0': {'b', 'c'}},
            'b': {'1': {'a'}},
            'c': {'1': {'s', 'd'}},
            'd': {'0': {'e'}},
            'e': {'0': {'c'}},
            'f': {'1': {'f'}},
            'g': {'0': {'g'}, '': {'h'}},
            'h': {'2': {'h'}}
        },
        initial_state='s',
        final_states={'f', 'h'}
    )

These don't use a ton of epsilon transitions but might help debug. Also I translated the regular expression syntax (since our regexes use slightly different syntax), so if you convert to DFA and they're non-equivalent it was my mistake in translation. Hope this helps and lmk if you have any more issues!

EDIT: Now that I think about it, if you're checking for correct behavior on non-equality, it's probably easier to just play around with two small regexes (since they're unequal most of the time).

caleb531 commented 1 year ago

@eliotwrobson Eureka! I fixed it!

The problem was that I did not correctly refactor is_final_state(). Because the values in lamda_closures are always sets, the logic was effecting checking if some_set in automaton.final_states, which will always be False.

Here's the final version, just committed:

diff --git a/automata/fa/nfa.py b/automata/fa/nfa.py
index 4c06284..c3065de 100644
--- a/automata/fa/nfa.py
+++ b/automata/fa/nfa.py
@@ -536,14 +536,18 @@ def __eq__(self, other):
             return NotImplemented

         # Get new initial states
-        initial_state_a = (frozenset({self.initial_state}), OriginEnum.SELF)
-        initial_state_b = (frozenset({other.initial_state}), OriginEnum.OTHER)
+        initial_state_a = (self.lambda_closures[self.initial_state], OriginEnum.SELF)
+        initial_state_b = (other.lambda_closures[other.initial_state], OriginEnum.OTHER)

         def is_final_state(states_pair):
             states, origin_enum = states_pair
             # If at least one of the current states is a final state, the
             # condition should satisfy
-            return bool(origin_automata[origin_enum].final_states - states)
+            for state in states:
+                automaton = origin_automata[origin_enum]
+                if len(automaton.final_states - automaton.lambda_closures[state]) > 0:
+                    return True
+            return False

         def transition(states_pair, symbol):
             states, origin_enum = states_pair
eliotwrobson commented 1 year ago

@caleb531 Fantastic! To your questions from before, for the NFA-native subset function, that can be implemented in the same way as the DFA subset function, as the logic in that DFA function relies on the product construction, and the PDF I linked outlines how to compute the product construction for NFAs (since that requires special handling of epsilon transitions). The state checking logic should be the same, the only real new piece that's needed is implementing the BFS of the product construction.

Right now, I'm trying to implement the algorithm to construct a minimal DFA directly from a finite language (found a description in a book). I can take a look at that subset function later, but you're welcome to take that on if you're feeling up to it right now.

caleb531 commented 1 year ago

@eliotwrobson Got it, that makes sense. I'll look at expanding the DFA/NFA operations a bit later. Next up, I'm going to switch my attention to the PDA and TM classes, whose implementations and tests are still not refactored for immutability.

eliotwrobson commented 1 year ago

@caleb531 Just to follow up (since I couldn't get a fast native subset algorithm for NFAs working), there appears to be one described here: http://www.lsv.fr/~doyen/papers/antichains-universality-cav06.pdf

It seems fairly complicated, but there is another implementation with source code here: https://github.com/thorstent/Limi Might be worth a shot. I don't think I'll have time to take a crack at it for a little while (school is starting to heat up again), but thought it was worth mentioning.

Tagl commented 1 year ago

I think it's perfectly reasonable to start out with worse-than-optimal algorithms in the library and slowly improve in cases like these. Working slowly is better than not working.

Based on the benchmarks in the github repo this algorithm may not necessarily perform better than others.

eliotwrobson commented 1 year ago

@Tagl in this case, the subset check can be done by converting to DFAs and using the algorithm there. My point is just that if we have a native NFA method for this, it should do something better than just calling other library functions.

Tagl commented 1 year ago

I think this is enough I can take it from here. Thank you!

Helper functions for construction of possibly infinite languages such as DFA.that_contains_substring and DFA.that_contains_subsequence would also be useful I think.

eliotwrobson commented 1 year ago

Yes, I think adding a few standard constructions to the library would be good. In addition to a contains_substring and contains_subsequence, something like a count_symbol and nth_from_end would be useful. There are a few others they use in that Mathematica package I linked, it's probably a good idea to copy those (since the constructions themselves are straightforward).

caleb531 commented 1 year ago

@eliotwrobson @Tagl Sounds fine to me. And as far as the names go, DFA. contains_substring and DFA.contains_subsequence would be my choices.

eliotwrobson commented 1 year ago

@Tagl Do you happen to have the code for any of these (or any other interesting) constructions laying around? I can make a PR for them later but if you have them on hand that would be great.

Tagl commented 1 year ago

@Tagl Do you happen to have the code for any of these (or any other interesting) constructions laying around? I can make a PR for them later but if you have them on hand that would be great.

I made a draft pull request with the ones I have (see #83)

eliotwrobson commented 1 year ago

After some more googling, it looks like there's a more recent paper on the language inclusion problem for NFAs that has more detailed pseudocode: https://user.it.uu.se/~parosh/publications/papers/tacas10.pdf

I'll see when I can take a crack at implementing this. Some tricky definitions but seems to be faster and more refined than the other one I linked.

EDIT: So looking at this algorithm, I'm realizing that it's too complicated to implement efficiently, but it just occurred to me that we can take the union of the input NFAs and check for equality with the other component NFAs 🤦🏽 🤦🏽 @caleb531 we should be able to add in this algorithm once the bug in the NFA equality is fixed.

caleb531 commented 1 year ago

@eliotwrobson @Tagl Since the Immutability work itself is already complete, I decided to merge the immutability branch into develop. In accordance with this, I've switched the target branches of both your open PRs from immutability to develop.

I've also deleted the immutability branch, so develop is back to being the correct target branch for which to submit PRs going forward.

Tagl commented 1 year ago

However, I actually think that an interesting hash function respecting this could be developed without too much trouble.

@eliotwrobson Did you have an idea for this?

eliotwrobson commented 1 year ago

@Tagl since we just want to approximate the set contained, we can just hash a frozendict of words accepted up to a certain fixed length.

Tagl commented 1 year ago

@Tagl since we just want to approximate the set contained, we can just hash a frozendict of words accepted up to a certain fixed length.

That was my naive idea as well. However this will result in a lot of collisions for languages that don't have any short words. Perhaps the successors generator (#92) would work. I'm not sure which is better. It's a good start though that we can improve on.

caleb531 commented 1 year ago

Closing this issue since immutability has been finalized and part of the recently-released v7 of the library. Any further discussions on a potential __hash__ function for DFAs/NFAs should probably be had an a new (separate) issue that links to this one.