Multiline pattern generation

curious-odd-man commented 3 years ago

Here is a sample of what could be generated with MULTILINE=True/False For pattern = ^a

When MULTILINE is FALSE

Generated values = [a]

When MULTILINE is TRUE

Generated values = [a, \na, \n\na, ...]

Note, however:

For pattern x$^y value x\ny does not match, when for pattern x$\n^y same value does match.

TODO:

[ ] Implement unit tests
[ ] Create configuration option to turn on multiline generation - off, but default
[ ] Create configuration option for the newline separator - system default, by default.
[ ] Implement text generation when multiline flag is ON.
[ ] [OPTIONAL] if #56 already resolved - switch multiline based on the m flag.

Feature initially requested by @spacether in #57

curious-odd-man commented 3 years ago

Hello @spacether!

I've been thinking on this feature you've requested and I would like to get more your input on this.

Do you have some specific use case where you require this feature or is it just an idea?

I have a doubt that I should add this feature.

When I've started this library my idea was to generate matching/not matching texts only with characters that are present in pattern. E.g. For pattern ^abc only generate abc text (in contrast to [abc, abcd, abcde, ... ]. By the logic that you've described - I would need to prepend newline character, that is not there in a pattern initially (similar to ^abc example above, which is not consistent with the initial idea.

Besides, without multiline flag pattern \n^a cannot match anything. vs

Currently it is always allowed to put newlines in RgxGen - this means that all generated patterns are always multiline. And if I will implement separate flag for multiline - then I will have to make unnecessary complications for case when multiline is OFF and there are newlines in pattern.

So in general I don't feel like I should add flag for multiline and generate characters that are not in a pattern. To keep consistent with initial idea and to keep things simple overall.

Please let me know what you think and if there are some specific use case for you where you need explicitly prohibit multiline generation and/or allow multiline without mentioning line separator character.

spacether commented 3 years ago

Hey there. Our users have not asked for this feature yet. With time I expect the request to come up. Doesn't your same logic apply to regex parens group matching? Not all of the regex will be the group match so for a(bcd) a is necessary and the group match is bcd?

Given enough time, I expect some users will definitely want to generate multiline = False regexes long term. What do you think? What if we kept the ticket open or closed and only implemented it if there were a certain number of plus 1 emojis on it?

One use case could be string validation of single line input data like first name, last name, address line 1 etc where newline characters should not be allowed. All use cases that I can imagine involve the presence or absence of the newline character because this flag is about multilines.

curious-odd-man commented 3 years ago

For the use case that you described - from my understanding users will need to have 2 regexes.

For positive case: ^\w+$
For negative case, either:
1. ^\W*$ with multiline = true, or
2. \n+^\W*$\n+ - at the current state (no multiline flag support)

I believe - the first negative case is not suitable, because it might not contain newlines, and thus probably does not cover all possible cases. On the other hand - second negative case is better, as it explicitly requires having trailing/leading newline characters.

Besides that for proper testing of the case I would have several negative patterns:

No newlines but incorrect characters
Correct characters with leading newline
Correct characters with trailing newlines

This composition of negative patterns has better coverage and easier trackability in case of errors. In any case I don't see how multiline flag fits here or how it can help. Please correct me if I'm wrong :)

As for your query about a(bcd) pattern. This pattern, same as a(b)cd and abcd and any other variation will all produce the same result - only abcd text. So the group in this pattern does not have any effect at all.

I will keep this ticket open. Probably will implement it some time later. Will assume it is lowest priority for now.

Let me know if this feature will show some demand.

curious-odd-man / RgxGen

Multiline pattern generation #58