Open vatjujar opened 7 years ago
Here is one of mine. For log lines like this: some-log-type: source-host-name 2017-07-01 00:00:01 - {"foo":1,"bar":2}
I set up a Glue custom classifier with:
Grok pattern: %{OURLOGWITHJSON}
Custom patterns: OURTIMESTAMP (%{TIMESTAMP_ISO8601}|%{YEAR}/%{MONTHNUM}/%{MONTHDAY} %{TIME}) OURWORDWITHDASHES \b[\w-]+\b OURLOGSTART %{OURWORDWITHDASHES:ourevent}:? %{SYSLOGHOST:logsource}( %{POSINT:pid})? %{OURTIMESTAMP:ourtimestamp} GREEDYJSON (\{.*\}) OURLOGWITHJSON ^%{OURLOGSTART}( - )?[^{]+%{GREEDYJSON:json}$
(Note Logstash works with GREEDYJSON ({.*}) but Glue's Grok parser rejects that)
and I get rows with four fields: ourevent: some-log-type logsource: source-host-name ourtimestamp: 2017-07-01 00:00:01 json: {"foo":1,"bar":2}
The Grok patterns are a bit more complicated than the minimum to match that, in particular the colon after "some-log-type" is optional, the ' - ' may or may not be present, and the timestamp might be in ISO8601 format.
I updated the text above, so the backslashes are now correctly shown in the GREEDYJSON pattern... (The text above elided the backslashes in front of the braces in the GREEDYJSON pattern -- I needed to add those in order for Glue's Grok parser to accept the pattern.)
I got the same issue than @vatjujar vatjujar.
We are also experiencing the same issue while trying to parse apache styled log lines—everything works perfect in online grok debuggers, but manually running a crawler shows nothing...a more detailed example would be greatly appreciated!
I have given many tries but not working , all my grok patterns work well with grok debugger but not in AWS Glue
I tried writing a pattern for single quoted semi json data file and it works on the debugger. However, not in Glue. Any help is much appreciated!
As shown above, I had to include backslashes before the brace characters (see "GREEDYJSON") to get it to match the JSON part of my log lines (to a string field named json, which I later unbox in a Glue script like this: ... unbox5 = Unbox.apply(frame = priorframe4, path = "json", format = "json", transformation_ctx = "unbox5") ...)
The backslashes weren't necessary in the online Grok debugger or in Logstash, but were necessary in Glue's Grok patterns. Dunno if that's your issue or not, but you might try throwing around some backslashes to see if it helps!
I just ran into this same issue. The problem was that in order to test an updated classifier, you need to create a whole new crawler. Simply updating the classifier and rerunning the crawler will NOT result in the updated classifier being used. This is not intuitive at all and lacks documentation in relevant places.
The only place this is explicitly mentioned (that I found) is in https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html - "To reclassify data to correct an incorrect classifier, create a new crawler with the updated classifier"
This nugget of information needs to be added to every other place custom classifiers are documented in bold capital letters.
I have lot text files in our S3 under different folders in columnar data sections. Automatic crawler does not recognize the schema in those files. How do we setup custom crawler for text files with column data.
How do we have crawler setup on S3 buckets with "ini" file formats?
I just ran into this same issue. The problem was that in order to test an updated classifier, you need to create a whole new crawler. Simply updating the classifier and rerunning the crawler will NOT result in the updated classifier being used. This is not intuitive at all and lacks documentation in relevant places.
The only place this is explicitly mentioned (that I found) is in https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html - "To reclassify data to correct an incorrect classifier, create a new crawler with the updated classifier"
This nugget of information needs to be added to every other place custom classifiers are documented in bold capital letters.
Thank you @vkubushyn , you saved me some time. I faced the same here.
in addition:
I'd like to see an example of custom classifier that is proven to work with custom data. The reason for the request is my headache when trying to write my own and my efforts simply do not work. My code (and patterns) work perfectly in online Grok debuggers, but they do not work in AWS. I do not get any errors in the logs either. My data simply does not get classified and table schemas are not created.
So, the classifier example should include a custom file to classify, maybe a log file of some sort. The file itself should include various types of information so that the example would demonstrate various pattern matches. Then the example should present the classifier rule, maybe even include a custom keyword to demonstrate the usage of that one too. Also, a deliberate mistake should also be demoed (both in input data and patterns) and how to debug this situation in AWS.
Thanks in advance!