caleb531 / automata

A Python library for simulating finite automata, pushdown automata, and Turing machines
https://caleb531.github.io/automata/
MIT License
341 stars 65 forks source link

Implement Automaton __iter__ for enumerating accepted language? #79

Closed caleb531 closed 1 year ago

caleb531 commented 1 year ago

@eliotwrobson So with all the recent discussion around automata ultimately being representations of finite languages, it got me thinking:

What if we implemented an __iter__ method that produced each string in that automaton's accepted language? My thought it we could generate one string at a time via yield, and leave it up to the user if they want to convert the automaton to a list or set. Ideally, we could start with DFAs/NFAs, and then explore if a similar algorithm is possible with PDAs and TMs.

How difficult do you think this would be to implement from an algorithmic perspective? Initially, my thought is generating all permutations of the machine's input symbols, then filter that set down to only those accepted via accepts_input.

Although the number of iterations could be enormous, so maybe we could filter that set further by analyzing transition symbols from the initial state (first character) and final states (possible last characters). But I'm curious if there is an approach that isn't so brute-force.

eliotwrobson commented 1 year ago

@caleb531 for DFA/NFAs, I think there are algorithms for this based on treating the transitions as a graph and doing graph search (or similar algorithms). I'll have to look through some books to see if they explicitly describe something like this, but here's a stackoverflow post describing something similar: https://stackoverflow.com/questions/33058142/optimal-algorithm-to-count-the-number-of-strings-a-dfa-accepts

As for the other automation, I'm not sure about PDAs, but for TMs even checking whether a TM accepts any string (let alone enumeration of all of them) is undecidable (equivalent to the halting problem).

EDIT: Now that I think about it, this problem boils down to traversing edges of a graph in lexicographic order and outputting the string corresponding to the current path. Then on the next yield statement, compute the next such string (like BFS but no need for a visited set).

Tagl commented 1 year ago

I think I have some code for this already written. I'll take a look tomorrow.

Tagl commented 1 year ago

Decided to look at it before I go to bed. So this is essentially just a dynamic programming problem. We can define recursive functions and then we just have to store results in memory to cut our time complexity down to $\mathcal{O}(nk)$, and raising space complexity to the $\mathcal{O}(nk)$, where $n$ is the number of states and $k$ is the length of words. Of course for generating the words instead of just counting them we have to account for constructing each string as well. This will add at least a factor of $k$ to the time complexity and a factor of $k$ to space complexity, but state space remains $nk$. I'll finish the code and optimize after work tomorrow but here's essentially the logic:

    def count_words_of_length(self, k):
        """
        Counts the number of words of length k in the language represented by the DFA
        """
        return self._count_words_of_length(self.initial_state, k)

    def _count_words_of_length(self, state, k):
        """
        Counts words of length k assuming state is the initial state
        """
        if k == 0:
            return 1 if state in self.final_states else 0
        return sum(self._count_words_of_length(self.transitions[state][symbol], k-1) for symbol in self.input_symbols)

    def words_of_length(self, k):
        """
        Generates all words of length k in the language represented by the DFA
        """
        for word in self._words_of_length(self.initial_state, k):
            yield word

    def _words_of_length(self, state, k):
        """
        Generator for accepted words of length k assuming state is the initial state
        """
        if k == 0:
            if state in self.final_states:
                yield ''
            return
        for symbol in self.input_symbols:
            for word in self._words_of_length(self.transitions[state][symbol], k-1):
                yield symbol + word

For __iter__ we will need to calculate the maximum word length and iterate till we reach it. Minimum word length to match would be nice to have too but not necessary.

Should we implement __len__ for the cardinality of the language and raise a ValueError for infinite languages? It is only allowed to return non-negative integers.

eliotwrobson commented 1 year ago

@Tagl Great that you already have a starting implementation for this! The only issue we might run into is efficiency, since recursion in Python is very slow. Since this algorithm needs to store the results anyway, do you think that this could be done iteratively with a dictionary of results?

I think doing the len as the cardinality of the language is a cool idea. I think it may be better to raise a specialized exception in that case. What do you think @caleb531 ?

caleb531 commented 1 year ago

I also dig the idea of using __len__ for the cardinality. And I agree with @eliotwrobson's point that a specialized exception should be raised instead of a ValueError.

Tagl commented 1 year ago

Yes, converting to doing it iteratively is no problem at all since the transitions if the DP are simple. It was one of the optimizations I had in mind.

Tagl commented 1 year ago

Thoughts on adding __contains__ to support usages of 'text' in dfa ?

caleb531 commented 1 year ago

@Tagl I think it would make more sense for __contains__ to look for the same kind of value that's being yielded by __iter__. But not sure how useful that would be

Tagl commented 1 year ago

I personally prefer doing word in language rather than language.accepts_input(word)

eliotwrobson commented 1 year ago

@Tagl I agree, I think that word in dfa is a nice shorthand for accepts input (even if a little bit awkward semantically).

caleb531 commented 1 year ago

@Tagl Oh, I see now. Because this whole conversation is about __iter__ yielding/producing those values anyway. So yeah, that would make a lot of sense, actually.

caleb531 commented 1 year ago

@eliotwrobson @Tagl But it will be important that __contains__ makes a direct call to accepts_input rather than enumerating the iterator of words in the accepted language (because the latter could be substantially slower).

Tagl commented 1 year ago

Please take a look at the implementation. Changed to iterative, calculates the entire level for length $k$ before yielding the first word of that length. Some repeated code max word length and is finite check. Think I'll have the logic in max word length function and then isfinite just calls it and checks if the return value is inf

Tagl commented 1 year ago

Ideally, we could start with DFAs/NFAs, and then explore if a similar algorithm is possible with PDAs and TMs.

DPDA we should be able to do. Non-determinism makes things a bit trickier.

I can implement the same kind of logic to NFA but the generation will essentially involve partial or full tracking of powersets like DFA.from_nfa does.

eliotwrobson commented 1 year ago

@Tagl If the NFA string generation involves tracking powersets, I'm not really sure that it's worth the effort to do for NFAs directly, if users can just determinize and use the existing DFA functions. It seems overall pretty impractical.

eliotwrobson commented 1 year ago

@Tagl since this issue has been resolved for FAs but we still want to do this with PDAs, would you be ok with closing this in favor of https://github.com/caleb531/automata/issues/102? I don't think this will get dropped and it makes sense to minimize the number of open issues because of this overlap.

Tagl commented 1 year ago

@Tagl since this issue has been resolved for FAs but we still want to do this with PDAs, would you be ok with closing this in favor of #102? I don't think this will get dropped and it makes sense to minimize the number of open issues because of this overlap.

Sounds good to me.