Open greatvovan opened 1 year ago
Tried regex_find_all()
but got:
sqlite> select * from regex_find_all('\w+', 'aaa-bb-c-dddd-eeeee') where rowid = 2;
thread '<unnamed>' panicked at 'not yet implemented', src/find_all.rs:79:22
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
Abort trap: 6
My apologies for the panic you got with WHERE rowid = 2
, it has now been fixed in v0.2.2
.
sqlite> select * from regex_find_all('\w+', 'aaa-bb-c-dddd-eeeee') where rowid = 2;
┌───────┬─────┬───────┐
│ start │ end │ match │
├───────┼─────┼───────┤
│ 7 │ 8 │ c │
└───────┴─────┴───────┘
But for capturing groups specifically, there isn't a great way. I want to add a new select * from regex_captures
table function where you'll get all the capture groups and their names/positions, which is being tracked in #1
Great. Any ETA on this by chance? Also curious why isn't it a great way – is it due to poor performance, poor readability or other?
I'll give it a shot in the next week or so!
I say "it's not a great way" mostly because it's awkward and not really readable. For single values like 'aaa-bb-c-dddd-eeeee'
it's fine, but it gets weirder on multiple values, like so:
with strings as (
select value
from json_each('[
"aaa-bb-c-dddd-eeeee",
"jjj-k-lll-mmmm-nnnn",
"only-two"
]')
)
select *
from strings
join regex_find_all(regex('\w+'), value) as parts
where parts.rowid = 2;
/*
┌─────────────────────┬───────┬─────┬───────┐
│ value │ start │ end │ match │
├─────────────────────┼───────┼─────┼───────┤
│ aaa-bb-c-dddd-eeeee │ 7 │ 8 │ c │
│ jjj-k-lll-mmmm-nnnn │ 6 │ 9 │ lll │
└─────────────────────┴───────┴─────┴───────┘
*/
First issue: There's a rowid = 2
in the WHERE
clause, so you would think that this query only returns one row. But it doesn't, since the regex_find_all
table function is called multiple times and returns multiple rows with the same rowid. Again, it works, but just awkward.
The second issue: For the "only-two"
value, no row is returned. This may be what you want, but I'd prefer to see a NULL
rather than a missing row for these types of cases.
All this to say, when capture group support is added, you'd be able to do something like:
with strings as (
select value
from json_each('[
"aaa-bb-c-dddd-eeeee",
"jjj-k-lll-mmmm-nnnn",
"only-two"
]')
)
select
value,
regex_capture(
'\w+-\w+-(?P<third_word>[^-]+).*',
value,
'third_word'
) as third_word
from strings;
Which, in my opinion, is much cleaner and easier to reason about.
Re performance: They should be both equal, but one important note when using regex table function like regex_find_all
is the wrap the regex()
function around patterns. This enables caching, which is tricky for table functions, but automatically done for scalar functions like regexp()
or regex_find()
.
Hey @greatvovan , I just pushed capture group support with the new regex_capture()
and regex_captures()
functions. It's available as of v0.2.3-alpha.3
.
Using your ID example, here's how it would work:
create table items as
select value as code
from json_each('["ID123Y2023ABC", "ID456Y2022ABC", "ID789Y1984ABC"]')
;
select
items.code,
regex_capture(
regex('ID(?P<id>\d+)Y(?P<year>\d+)ABC'),
items.code,
'id'
) as id,
regex_capture(
regex('ID(?P<id>\d+)Y(?P<year>\d+)ABC'),
items.code,
'year'
) as year
from items;
There's also the regex_captures()
table function, which is only really good if you want to extract multiple matches from the same thing, but might be useful:
select
rowid,
regex_capture(captures, 'id') as id,
regex_capture(captures, 'year') as year
from regex_captures(
regex('ID(?P<id>\d+)Y(?P<year>\d+)ABC'),
"ID123Y2023ABC, ID456Y2022ABC, and ID789Y1984ABC"
);
/*
┌───────┬─────┬──────┐
│ rowid │ id │ year │
├───────┼─────┼──────┤
│ 0 │ 123 │ 2023 │
│ 1 │ 456 │ 2022 │
│ 2 │ 789 │ 1984 │
└───────┴─────┴──────┘
*/
Due to limitations in the SQLite virtual table interface, we can't return custom column names with a table function, so we can't do something like select id, year from regex_captures("...", "...")
. But I'm tracking a possible solution to that in #9
How can I extract the second group from
aa-bb-c-dddd-eeeee
defined as(\w+)-(\w+)-(\w+)-(\w+)-(\w+)
(nothing, if no match)?