Closed Abzac closed 1 year ago
Hello.
This is possible: we have a library for user-agent parsing. But I think, we are not going to disclose data for that library: parsing rules, that must be constantly updated to support all new user agents (tons of mobile phones for example).
OK, that's just fine. It would be great if you'll make some functions for this. Any expectations for the time it will be implemented?
Then I can try to extract parsing rules database from some opensource ua parsing library if it would be possible to use this data in ClickHouse (like internal dictionaries filled from CSV for geobase).
There is outdated uatraits
library for user agent parsing: https://github.com/highpower/uatraits
(It was not updated on GitHub for many years and also performance is suboptimal.)
We have another, completely different library, that share data format with uatraits
.
I am not sure we are going to open-source it, but it is possible.
Probably we should add integration with https://github.com/01org/hyperscan instead.
You can also take a look on that: https://github.com/ua-parser
According to description it's maintained DB of browser detection regular expressions + libs to work with that DB for many programming languages, including cpp ( https://github.com/ua-parser/uap-cpp ).
I just saw that and have plans to test it one day in our project, so for now i have no experience with that and can't provide any testimonials. Looking on the DB structure - detection can be quite good, but relatively slow. May be someone, who need to solve that problem right now - can check it, may be speed will be satisfactory.
You can also take a look on that: https://github.com/ua-parser
Application of regexps one by one; using boost::regex: that's extremely slow.
Our management has approved usage of Yandex internal fast library for user-agent parsing in ClickHouse. But the task is still out of scope of our plans.
Sounds amazing!
Hyperscan has been added: https://github.com/yandex/ClickHouse/pull/4780
@alexey-milovidov Wow! Great! But how can it be used to get user agent info from User-Agent string?
Because search function and match filter functions are not very useful in this case. The main problem is you need to have some mapping in memory with UA data to effectively use it. So unfortunately you can not just write down some query like this:
select multiSearchAny(URL, ['yandex', 'chrome', 'firefox', 'opera'] from clicks
Number of regexes would be enormously large to pass it every time in query, and also regex may even not contain browser name and vice versa, plus you can not actually get browser version from here.
There is multiMatchAnyIndex
function that will return an index of matching regexp in array of regexps (starting from 1; or zero if no regexp matched).
Yes, I understand that amount of regexps will be huge for User-Agents. They also are very ambiguous (MSIE, Mozilla, like Gecko...) Maybe this function is not applicable and I have not tried.
So, I keep this task open.
This task is assigned to external contributor Mikhail Filitov (@lll-phill-lll).
The library for User-Agent parsing was published in #10822
But the task is in progress (now you can see just the exported sources of the library and it cannot be used).
The library for User-Agent parsing was published in #10822 But the task is in progress (now you can see just the exported sources of the library and it cannot be used).
Will this be available in the next release?
Most likely not in the next release but in one of the releases this summer.
The code is available, it is working fine, but the pull request is not ready to be merged - we have to perform multiple cleanups to make it ready.
Would also like to express our interest in this capability. Any idea if/when this will get more attention? Thanks!
We just open-sourced our internal library for User-Agent parsing and checked that it can work with ClickHouse in a prototype. There is still many things to do to make it production ready. It is not in our nearest plans, but we consider it to be done in 2020.
Awesome. Thanks @alexey-milovidov, we'll be eagerly waiting!
The feature has been added in version 23.1 and finally, here are the tests proving the possibility to use it for User-Agent parsing: https://github.com/ClickHouse/ClickHouse/pull/45631
Done.
Hello!
I want to parse vast amount of access logs loaded ClickHouse and extract operating system, browser and device type (mobile/desktop/search crawler bot) on-the-fly.
Do you have any plans to make functions for parsing User-Agent (like functions for url parsing)?