ga4gh-discovery / data-connect

Standard for describing and searching biomedical data developed by the Global Alliance for Genomics & Health.
Apache License 2.0
24 stars 14 forks source link

Revisit regular expression functions #134

Open kemp-verily opened 3 years ago

kemp-verily commented 3 years ago

As noted by @jfuerth , these functions are very useful for "just in time" harmonization.

We may want to look at:

jfuerth commented 3 years ago

Here is an example of a UDF in MySQL that polyfills json_extract_scalar (not relevant to regex, but illustrates the concept):

create function json_extract_scalar(json_doc JSON, json_path TEXT)
returns VARCHAR(512)
deterministic
return json_unquote(json_extract(json_doc, json_path));

Then:

select json_extract_scalar(CAST('{"hello":"yes"}' AS JSON), '$');       --returns '{"hello": "yes"}'
select json_extract_scalar(CAST('{"hello":"yes"}' AS JSON), '$.hello'); --returns 'yes'
NehaAr commented 1 year ago

If it is related to the optimisation of SQL queriesfor Regualr expression. We can try and improve the indexing scheme of the tables like using multigram indexes which optimise the search operation

jfuerth commented 1 year ago

Thanks Neha. We have also had success using n-gram indexes to speed up pattern matching in a few of our implementations.

I think the original reason for eliminating regular expression functions from the SQL grammar in the spec was to reduce the barrier to new implementations on various database platforms (MySQL, MS SQL Server, Oracle, various cloud databases, and so on).

I'm curious: would a standardized requirement for JSON functions in Data Connect help with your use case?