apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.47k stars 3.7k forks source link

Regex parser should have the option to 'skip' some number of header rows like the CSV parser can #8583

Open vogievetsky opened 5 years ago

vogievetsky commented 5 years ago

This would be super useful for ingesting data that has some form of a header such as what is seen in https://github.com/apache/incubator-druid/issues/8555.

ntantri commented 5 years ago

Hi,

I would like to work on this PR.

vogievetsky commented 5 years ago

Awesome!

vogievetsky commented 5 years ago

@tan31989 are you still interested in working on this?

ntantri commented 5 years ago

@vogievetsky yes, I'm trying to figure out the required changes. Kind of stuck with figuring out the linked issue with this

ntantri commented 4 years ago

@vogievetsky I have tried X number of ways, trying to copy the CSVParser kind of implementations. Pardon me if this is vague, but I see the following code is using: if (!matcher.matches()) {} in here is used for matching entire text.

I feel that beats the purpose of Regex parser, where if the pattern does not match until the entire text is matched as a whole. I was of the opinion it would best fit the use cases, where we use: while (matcher.find()) {}, thus providing us with the ability to write regex with more flexibilities.

With matcher.find() it's easier to replicate a regex pattern find and group. Adding a regex to match an entire string as is always ends up with using a global filter like (.*). There are so many variants of regex that would be missed because of this.

yashwanth-Thota commented 2 years ago

I would like to work on this