Kotlin / KEEP

Kotlin Evolution and Enhancement Process
Apache License 2.0
3.39k stars 360 forks source link

Extensions to find all occurrences of a pattern in a CharSequence #20

Closed ilya-g closed 3 years ago

ilya-g commented 8 years ago

Discussions about the Extensions to find all occurrences of a pattern in a CharSequence will be held here.

ilya-g commented 8 years ago

We haven't discussed the proposal in details yet, but here's my feedback.

I'm concerned a bit about use cases. While they all look valid, I cannot estimate how much these cases are in demand.

I've tried to search GitHub for the specific patterns listed in alternatives (Regex.findAll and indexOf-while loop), but haven't succeeded to find significant number of cases where the function would fit.

I've also consulted with Sergei Lebedev (@superbobry) who works as a Research Engineer on high-performance biological data analysis projects in JetBrains Biolabs, about the use case with finding DNA subsequences. It occurred that while it is a valid use case, the linear performance achieved by KMP algorithm isn't feasible to perform searches in extremely long sequences. Some indexing approaches like suffix trees are applied to a string being searched in, and not to a pattern string in this case.

Regarding to the alternatives, it's interesting to note that Regex also has a similar optimization for the cases where KMP algorithm shines — it uses Boyer-Moore search algorithm for long string literals. So, I believe it isn't accurate to state "too slow" as a consideration against the alternatives without performing some benchmarks first.

cbruegg commented 8 years ago

I'm concerned a bit about use cases.

Point taken on the bioinformatics use case, that might not have been the best example. I think there is demand though:

Non-Overlapping

Overlapping

Regex also has a similar optimization

I've run some benchmarks and indeed the performance of the Regex is identical to my proposed method using findAll for non-overlapping matching. Unfortunately, there's no way to make a Regex match overlapping occurrences natively, which makes this benchmark print occ: 18 ms, regex: 24695 ms, indexOf: 9661 ms. The longer the pattern becomes, the better performs occurrencesOf in comparison with indexOf and the Regex.

ilya-g commented 8 years ago

Note that the regex use cases you have given most of the time involve searching non-literal patterns, so they cannot be readily replaced with occurrencesOf

cbruegg commented 8 years ago

Right, then there are these for example:

ilya-g commented 8 years ago

I've looked through these SO questions, and still they look rather artificial to me, I can't imagine how it can be useful in real-life cases.

We had an initial discussion about this proposal today and found that while it can be potentially useful, the current use-cases aren't convincing enough to include it into the standard library. Instead we'd like to establish some library with not-so-often-required but still useful text-related functions kotlinx.text. This proposal is a good candidate for inclusion in that library.

cbruegg commented 8 years ago

Sounds like a good compromise to me, thanks for the update. Does the KEEP need an update to discuss its inclusion in kotlinx?

ilya-g notifications@github.com schrieb am Mo., 4. Juli 2016, 17:47:

I've looked through these SO questions, and still they look rather artificial to me, I can't imagine how it can be useful in real-life cases.

We had an initial discussion about this proposal today and found that while it can be potentially useful, the current use-cases aren't convincing enough to include it into the standard library. Instead we'd like to establish some library with not-so-often-required but still useful text-related functions kotlinx.text. This proposal is a good candidate for inclusion in that library.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Kotlin/KEEP/issues/20#issuecomment-230317194, or mute the thread https://github.com/notifications/unsubscribe/AAKtPRfICcZcmGH3Fr2VacfTN6IjBepUks5qSSrAgaJpZM4IryzX .

voddan commented 8 years ago

I would think this what find or findAll function does. Unfortunate find already exists and means first on iterables.

ilya-g commented 3 years ago

Instead we'd like to establish some library with not-so-often-required but still useful text-related functions kotlinx.text

I've opened the issue KT-45445 to collect ideas for such a library, and linked this proposal there. I'm going to close this proposal here because it's no longer about the standard library.