ClickHouse / ClickHouse

ClickHouse® is a real-time analytics DBMS
https://clickhouse.com
Apache License 2.0
37.33k stars 6.88k forks source link

Any plans for User-Agent header parsing? #157

Closed Abzac closed 1 year ago

Abzac commented 7 years ago

Hello!

I want to parse vast amount of access logs loaded ClickHouse and extract operating system, browser and device type (mobile/desktop/search crawler bot) on-the-fly.

Do you have any plans to make functions for parsing User-Agent (like functions for url parsing)?

alexey-milovidov commented 7 years ago

Hello.

This is possible: we have a library for user-agent parsing. But I think, we are not going to disclose data for that library: parsing rules, that must be constantly updated to support all new user agents (tons of mobile phones for example).

Abzac commented 7 years ago

OK, that's just fine. It would be great if you'll make some functions for this. Any expectations for the time it will be implemented?

Then I can try to extract parsing rules database from some opensource ua parsing library if it would be possible to use this data in ClickHouse (like internal dictionaries filled from CSV for geobase).

alexey-milovidov commented 7 years ago

There is outdated uatraits library for user agent parsing: https://github.com/highpower/uatraits (It was not updated on GitHub for many years and also performance is suboptimal.)

We have another, completely different library, that share data format with uatraits. I am not sure we are going to open-source it, but it is possible.

alexey-milovidov commented 7 years ago

Probably we should add integration with https://github.com/01org/hyperscan instead.

filimonov commented 6 years ago

You can also take a look on that: https://github.com/ua-parser

According to description it's maintained DB of browser detection regular expressions + libs to work with that DB for many programming languages, including cpp ( https://github.com/ua-parser/uap-cpp ).

I just saw that and have plans to test it one day in our project, so for now i have no experience with that and can't provide any testimonials. Looking on the DB structure - detection can be quite good, but relatively slow. May be someone, who need to solve that problem right now - can check it, may be speed will be satisfactory.

alexey-milovidov commented 6 years ago

You can also take a look on that: https://github.com/ua-parser

Application of regexps one by one; using boost::regex: that's extremely slow.

alexey-milovidov commented 5 years ago

Our management has approved usage of Yandex internal fast library for user-agent parsing in ClickHouse. But the task is still out of scope of our plans.

Abzac commented 5 years ago

Sounds amazing!

alexey-milovidov commented 5 years ago

Hyperscan has been added: https://github.com/yandex/ClickHouse/pull/4780

Abzac commented 5 years ago

@alexey-milovidov Wow! Great! But how can it be used to get user agent info from User-Agent string?

Because search function and match filter functions are not very useful in this case. The main problem is you need to have some mapping in memory with UA data to effectively use it. So unfortunately you can not just write down some query like this: select multiSearchAny(URL, ['yandex', 'chrome', 'firefox', 'opera'] from clicks

Number of regexes would be enormously large to pass it every time in query, and also regex may even not contain browser name and vice versa, plus you can not actually get browser version from here.

alexey-milovidov commented 5 years ago

There is multiMatchAnyIndex function that will return an index of matching regexp in array of regexps (starting from 1; or zero if no regexp matched).

Yes, I understand that amount of regexps will be huge for User-Agents. They also are very ambiguous (MSIE, Mozilla, like Gecko...) Maybe this function is not applicable and I have not tried.

So, I keep this task open.

alexey-milovidov commented 4 years ago

This task is assigned to external contributor Mikhail Filitov (@lll-phill-lll).

alexey-milovidov commented 4 years ago

The library for User-Agent parsing was published in #10822
But the task is in progress (now you can see just the exported sources of the library and it cannot be used).

taythebot commented 4 years ago

The library for User-Agent parsing was published in #10822 But the task is in progress (now you can see just the exported sources of the library and it cannot be used).

Will this be available in the next release?

alexey-milovidov commented 4 years ago

Most likely not in the next release but in one of the releases this summer.

The code is available, it is working fine, but the pull request is not ready to be merged - we have to perform multiple cleanups to make it ready.

filimonov commented 4 years ago

11160

rikonor commented 4 years ago

Would also like to express our interest in this capability. Any idea if/when this will get more attention? Thanks!

alexey-milovidov commented 4 years ago

We just open-sourced our internal library for User-Agent parsing and checked that it can work with ClickHouse in a prototype. There is still many things to do to make it production ready. It is not in our nearest plans, but we consider it to be done in 2020.

rikonor commented 4 years ago

Awesome. Thanks @alexey-milovidov, we'll be eagerly waiting!

alexey-milovidov commented 1 year ago

The feature has been added in version 23.1 and finally, here are the tests proving the possibility to use it for User-Agent parsing: https://github.com/ClickHouse/ClickHouse/pull/45631

alexey-milovidov commented 1 year ago

Done.