asg017 / sqlite-regex

A fast regular expression SQLite extension, written in Rust
Apache License 2.0
166 stars 7 forks source link

Is there support for capturing groups? #8

Open greatvovan opened 1 year ago

greatvovan commented 1 year ago

How can I extract the second group from aa-bb-c-dddd-eeeee defined as (\w+)-(\w+)-(\w+)-(\w+)-(\w+) (nothing, if no match)?

greatvovan commented 1 year ago

Tried regex_find_all() but got:

sqlite> select * from regex_find_all('\w+', 'aaa-bb-c-dddd-eeeee') where rowid = 2;
thread '<unnamed>' panicked at 'not yet implemented', src/find_all.rs:79:22
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
Abort trap: 6
asg017 commented 1 year ago

My apologies for the panic you got with WHERE rowid = 2, it has now been fixed in v0.2.2.

sqlite> select * from regex_find_all('\w+', 'aaa-bb-c-dddd-eeeee') where rowid = 2;
┌───────┬─────┬───────┐
│ start │ end │ match │
├───────┼─────┼───────┤
│ 7     │ 8   │ c     │
└───────┴─────┴───────┘

But for capturing groups specifically, there isn't a great way. I want to add a new select * from regex_captures table function where you'll get all the capture groups and their names/positions, which is being tracked in #1

greatvovan commented 1 year ago

Great. Any ETA on this by chance? Also curious why isn't it a great way – is it due to poor performance, poor readability or other?

asg017 commented 1 year ago

I'll give it a shot in the next week or so!

I say "it's not a great way" mostly because it's awkward and not really readable. For single values like 'aaa-bb-c-dddd-eeeee' it's fine, but it gets weirder on multiple values, like so:

with strings as (
  select value
  from json_each('[
    "aaa-bb-c-dddd-eeeee",
    "jjj-k-lll-mmmm-nnnn",
    "only-two"
  ]')
)
select * 
from strings
join regex_find_all(regex('\w+'), value) as parts 
where parts.rowid = 2;
/*
┌─────────────────────┬───────┬─────┬───────┐
│        value        │ start │ end │ match │
├─────────────────────┼───────┼─────┼───────┤
│ aaa-bb-c-dddd-eeeee │ 7     │ 8   │ c     │
│ jjj-k-lll-mmmm-nnnn │ 6     │ 9   │ lll   │
└─────────────────────┴───────┴─────┴───────┘
*/

First issue: There's a rowid = 2 in the WHERE clause, so you would think that this query only returns one row. But it doesn't, since the regex_find_all table function is called multiple times and returns multiple rows with the same rowid. Again, it works, but just awkward.

The second issue: For the "only-two" value, no row is returned. This may be what you want, but I'd prefer to see a NULL rather than a missing row for these types of cases.

All this to say, when capture group support is added, you'd be able to do something like:

with strings as (
  select value
  from json_each('[
    "aaa-bb-c-dddd-eeeee",
    "jjj-k-lll-mmmm-nnnn",
    "only-two"
  ]')
)
select 
  value,
  regex_capture(
    '\w+-\w+-(?P<third_word>[^-]+).*', 
    value, 
    'third_word'
  ) as third_word
from strings;

Which, in my opinion, is much cleaner and easier to reason about.

Re performance: They should be both equal, but one important note when using regex table function like regex_find_all is the wrap the regex() function around patterns. This enables caching, which is tricky for table functions, but automatically done for scalar functions like regexp() or regex_find().

asg017 commented 1 year ago

Hey @greatvovan , I just pushed capture group support with the new regex_capture() and regex_captures() functions. It's available as of v0.2.3-alpha.3.

Using your ID example, here's how it would work:

create table items as 
  select value as code
  from json_each('["ID123Y2023ABC", "ID456Y2022ABC", "ID789Y1984ABC"]')
;

select 
  items.code,
  regex_capture(
    regex('ID(?P<id>\d+)Y(?P<year>\d+)ABC'),
    items.code,
    'id'
  ) as id,
  regex_capture(
    regex('ID(?P<id>\d+)Y(?P<year>\d+)ABC'),
    items.code,
    'year'
  ) as year
from items;

There's also the regex_captures() table function, which is only really good if you want to extract multiple matches from the same thing, but might be useful:

select
  rowid,
  regex_capture(captures, 'id') as id,
  regex_capture(captures, 'year') as year
from regex_captures(
  regex('ID(?P<id>\d+)Y(?P<year>\d+)ABC'),
  "ID123Y2023ABC, ID456Y2022ABC, and ID789Y1984ABC"
);
/*
┌───────┬─────┬──────┐
│ rowid │ id  │ year │
├───────┼─────┼──────┤
│ 0     │ 123 │ 2023 │
│ 1     │ 456 │ 2022 │
│ 2     │ 789 │ 1984 │
└───────┴─────┴──────┘
*/

Due to limitations in the SQLite virtual table interface, we can't return custom column names with a table function, so we can't do something like select id, year from regex_captures("...", "..."). But I'm tracking a possible solution to that in #9