Closed gwalashish closed 7 years ago
I believe that what has been extracted does not match the grok pattern you defined?
Can you just try to set in your pipeline a field instead?
{
"set": {
"field": "foo",
"value": "bar"
}
}
My document contains "10.23.22.22 is the IP address of the server", grok pattern is matching properly when I did the simulation but using fscrawler it was not working,
PUT _ingest/pipeline/fscrawler
{
"description" : "Testing Grok on PDF upload",
"processors" : [
{
"grok": {
"field": "content",
"patterns": ["%{IP:ip} %{GREEDYDATA}"]
}
}
]
}
I have made the changes into pipeline as you have mentioned above,
PUT _ingest/pipeline/fscrawler
{
"description" : "Testing Grok on PDF upload",
"processors" : [
{
"set": {
"field": "content",
"value": "Test 123"
}
}
]
}
After doing this, I was getting error like "field is a required parameter" while seeing the index in discover.
Please tell me if I am doing anything wrong.
Can you share what is the JSON _source
for your PDF document once it has been parsed by FSCrawler?
GET index/doc/id
It says "found' : "false".
Of course. You need to replace with your index name, the right type and the right id of the document.
You can find them by running a search.
I have added the same thing,
GET pdf_upload/_search
but it does not give any information which we are looking for. If I do the same thing on other indexes then it gives the _id, _type information.
but it does not give any information which we are looking for
May be. I can't tell as I can't see it.
Anyway, can you remove the "pipeline" : "fscrawler"
from your fscrawler setting, try again and give back the result of the search?
Please share also your full fscrawler config file. And please format it in github as it's more readable.
Thanks.
Even after removing "pipeline" : "fscrawler"
from setting file it was not working, I have deleted the whole job directory and recreated the same again. After that index got created.
{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 2, "max_score": 1, "hits": [ { "_index": "pdf_upload", "_type": "folder", "_id": "824b64ab42d4b63cda6e747e2b80e5", "_score": 1, "_source": { "encoded": "824b64ab42d4b63cda6e747e2b80e5", "root": "824b64ab42d4b63cda6e747e2b80e5", "real": "/tmp/es" } }, { "_index": "pdf_upload", "_type": "doc", "_id": "8c3f1f54665e48419b1a2313dd21624", "_score": 1, "_source": { "content": """ 10.21.23.123 is the IP address of the PXE server. """, "meta": { "raw": { "pdf:PDFVersion": "1.4", "X-Parsed-By": "org.apache.tika.parser.DefaultParser", "xmp:CreatorTool": "Writer", "access_permission:modify_annotations": "true", "access_permission:can_print_degraded": "true", "meta:creation-date": "2017-06-29T07:16:03Z", "created": "Thu Jun 29 12:46:03 IST 2017", "access_permission:extract_for_accessibility": "true", "access_permission:assemble_document": "true", "xmpTPg:NPages": "1", "Creation-Date": "2017-06-29T07:16:03Z", "dcterms:created": "2017-06-29T07:16:03Z", "dc:format": "application/pdf; version=1.4", "access_permission:extract_content": "true", "access_permission:can_print": "true", "pdf:docinfo:creator_tool": "Writer", "access_permission:fill_in_form": "true", "pdf:encrypted": "false", "producer": "LibreOffice 5.1", "access_permission:can_modify": "true", "pdf:docinfo:producer": "LibreOffice 5.1", "pdf:docinfo:created": "2017-06-29T07:16:03Z", "Content-Type": "application/pdf" } }, "file": { "extension": "pdf", "content_type": "application/pdf", "last_modified": "2017-06-29T12:47:54", "indexing_date": "2017-06-29T17:12:40.712", "filesize": 6899, "filename": "test_pdf.pdf", "url": "file:///tmp/es/test_pdf.pdf" }, "path": { "encoded": "824b64ab42d4b63cda6e747e2b80e5", "root": "824b64ab42d4b63cda6e747e2b80e5", "virtual": "/", "real": "/tmp/es/test_pdf.pdf" } } } ] } }
Please see below the fscrawler's setting file,
{ "name" : "pdf_upload", "fs" : { "url" : "/tmp/es", "update_rate" : "15m", "excludes" : [ "~*" ], "json_support" : false, "filename_as_id" : false, "add_filesize" : true, "remove_deleted" : true, "add_as_inner_object" : false, "store_source" : false, "index_content" : true, "attributes_support" : false, "raw_metadata" : true, "xml_support" : false, "index_folders" : true, "lang_detect" : false }, "elasticsearch" : { "nodes" : [ { "host" : "XX.XX.XX.XX", "port" : 9200, "scheme" : "HTTP" } ], "type" : "doc", "bulk_size" : 100, "flush_interval" : "5s" }, "rest" : { "scheme" : "HTTP", "host" : "127.0.0.1", "port" : 8080, "endpoint" : "fscrawler" } }
Please let me know what I am missing now.
Please format the code. Don't quote.
If you did not touch the JSON content I can see that your document is generated as:
10.21.23.123 is the IP address of the PXE server.
You can see that there is a \n
at the beginning. And I think that this is not going to match %{IP:ip} %{GREEDYDATA}
.
I am trying to map the "\n" but I was not able to map it by giving below grok after substituting the whole content field, please let me know what could be the problem because in simulator "\n" is getting replaced by the "-", so that pipeline is working fine. please see pipeline configuration below
PUT _ingest/pipeline/pdfgrep
{
"description" : "Testing Grok on PDF upload",
"processors" : [
{
"gsub": {
"field": "content",
"pattern": "\n",
"replacement": "-"
},
"grok": {
"field": "content",
"patterns": ["%{DATA}%{IP:ip_addr} %{GREEDYDATA}"]
}
}
]
}
Without using pipeline in fscrawler, this was the output of
GET pdf_upload/doc/8c3f1f54665e48419b1a2313dd21624
{
"_index": "pdf_upload",
"_type": "doc",
"_id": "8c3f1f54665e48419b1a2313dd21624",
"_version": 1,
"_score": null,
"_source": {
"content": "\n10.21.23.123 is the IP address of the PXE server.\n\n\n",
"meta": {
"raw": {
"pdf:PDFVersion": "1.4",
"X-Parsed-By": "org.apache.tika.parser.DefaultParser",
"xmp:CreatorTool": "Writer",
"access_permission:modify_annotations": "true",
"access_permission:can_print_degraded": "true",
"meta:creation-date": "2017-06-29T07:16:03Z",
"created": "Thu Jun 29 12:46:03 IST 2017",
"access_permission:extract_for_accessibility": "true",
"access_permission:assemble_document": "true",
"xmpTPg:NPages": "1",
"Creation-Date": "2017-06-29T07:16:03Z",
"dcterms:created": "2017-06-29T07:16:03Z",
"dc:format": "application/pdf; version=1.4",
"access_permission:extract_content": "true",
"access_permission:can_print": "true",
"pdf:docinfo:creator_tool": "Writer",
"access_permission:fill_in_form": "true",
"pdf:encrypted": "false",
"producer": "LibreOffice 5.1",
"access_permission:can_modify": "true",
"pdf:docinfo:producer": "LibreOffice 5.1",
"pdf:docinfo:created": "2017-06-29T07:16:03Z",
"Content-Type": "application/pdf"
}
},
"file": {
"extension": "pdf",
"content_type": "application/pdf",
"last_modified": "2017-06-29T12:47:54",
"indexing_date": "2017-06-30T10:44:54.864",
"filesize": 6899,
"filename": "test_pdf.pdf",
"url": "file:///tmp/es/test_pdf.pdf"
},
"path": {
"encoded": "824b64ab42d4b63cda6e747e2b80e5",
"root": "824b64ab42d4b63cda6e747e2b80e5",
"virtual": "/",
"real": "/tmp/es/test_pdf.pdf"
}
},
"fields": {
"file.last_modified": [
1498740474000
],
"meta.raw.Creation-Date": [
1498720563000
],
"meta.raw.meta:creation-date": [
1498720563000
],
"meta.raw.pdf:docinfo:created": [
1498720563000
],
"file.indexing_date": [
1498819494864
],
"meta.raw.dcterms:created": [
1498720563000
]
},
"sort": [
1498740474000
]
}
After seeing the above content field which contains "\n" so based on the I have simulated the same thing, please see below,
POST _ingest/pipeline/pdfgrep/_simulate
{ "docs": [
{ "_source":
{ "content": "\n10.21.23.123 is the IP address of the PXE server.\n\n\n"
}
}
]
}
Output of the simulator is given below, please see,
{
"docs": [
{
"doc": {
"_index": "_index",
"_type": "_type",
"_id": "_id",
"_source": {
"ip_addr": "10.21.23.123",
"content": "-10.21.23.123 is the IP address of the PXE server.---"
},
"_ingest": {
"timestamp": "2017-06-30T05:27:13.563Z"
}
}
}
]
}
If you see above everything should work but when I was using fscrawler with "pipeline" : "pdfgrep" then it was not working, giving error like "field" parameter is now invalid. Please select a new field in the discover.
{
"name" : "pdf_upload",
"fs" : {
"url" : "/tmp/es",
"update_rate" : "15m",
"excludes" : [ "~*" ],
"json_support" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : true,
"add_as_inner_object" : false,
"store_source" : false,
"index_content" : true,
"attributes_support" : false,
"raw_metadata" : true,
"xml_support" : false,
"index_folders" : true,
"lang_detect" : false
},
"elasticsearch" : {
"nodes" : [ {
"pipeline" : "pdfgrep",
"host" : "XX.XX.XX.XX",
"port" : 9200,
"scheme" : "HTTP"
} ],
"type" : "doc",
"bulk_size" : 100,
"flush_interval" : "5s"
},
"rest" : {
"scheme" : "HTTP",
"host" : "127.0.0.1",
"port" : 8080,
"endpoint" : "fscrawler"
}
}
Please let me know if you can find anything wrong in the above process.
Can you share your PDF document?
Please see the PDF document which has been attached. testing.pdf
This is weird. I can see that kind of errors in elasticsearch logs:
[2017-06-30T09:29:50,649][DEBUG][o.e.a.b.TransportBulkAction] [wsSoTCn] failed to execute pipeline [fscrawler_test_ingest_pipeline_392] for document [fscrawler_test_ingest_pipeline_392/folder/f614ecb527212f3abd1a7befab87132]
org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [content]
at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-5.4.2.jar:5.4.2]
at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-5.4.2.jar:5.4.2]
at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-5.4.2.jar:5.4.2]
at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:166) ~[elasticsearch-5.4.2.jar:5.4.2]
at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:41) ~[elasticsearch-5.4.2.jar:5.4.2]
at org.elasticsearch.ingest.PipelineExecutionService$2.doRun(PipelineExecutionService.java:88) [elasticsearch-5.4.2.jar:5.4.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.2.jar:5.4.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.2.jar:5.4.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [content]
... 11 more
Caused by: java.lang.IllegalArgumentException: field [content] not present as part of path [content]
at org.elasticsearch.ingest.IngestDocument.resolve(IngestDocument.java:340) ~[elasticsearch-5.4.2.jar:5.4.2]
at org.elasticsearch.ingest.IngestDocument.getFieldValue(IngestDocument.java:108) ~[elasticsearch-5.4.2.jar:5.4.2]
at org.elasticsearch.ingest.common.GsubProcessor.execute(GsubProcessor.java:67) ~[?:?]
at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-5.4.2.jar:5.4.2]
... 9 more
field [content] not present as part of path [content]
... I don't get it. I need to dive into this. Stay tuned.
Forget it. That was a stupid thing in my test suite. It's working fine now. I have been able to parse your document but note that I changed the pipeline job as I believe there is an error in it although it seems to be accepted by ingest and is working well when using simulation.
So at the end, here is what I have defined:
PUT _ingest/pipeline/fscrawler_test_ingest_pipeline_392
{
"description": "Testing Grok on PDF upload",
"processors": [
{
"gsub": {
"field": "content",
"pattern": "\n",
"replacement": "-"
}
},
{
"grok": {
"field": "content",
"patterns": [
"%{DATA}%{IP:ip_addr} %{GREEDYDATA}"
]
}
}
]
}
Which gives after running FSCrawler on your PDF doc:
GET fscrawler_test_ingest_pipeline_392/_search
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "fscrawler_test_ingest_pipeline_392",
"_type": "doc",
"_id": "e16c878ca9b865328b9d2b28daca7183",
"_score": 1,
"_source": {
"path": {
"virtual": "/issue-392.pdf",
"root": "f641fc38a5712ab557af652cbbbdc",
"real": "/var/folders/r_/r14sy86n2zb91jyz1ptb5b4w0000gn/T/junit8491860562203292911/resources/test_ingest_pipeline_392/issue-392.pdf"
},
"file": {
"extension": "pdf",
"filename": "issue-392.pdf",
"content_type": "application/pdf",
"indexing_date": "2017-06-30T07:34:38.059+0000",
"filesize": 6811,
"last_modified": "2017-06-30T07:34:27.000+0000",
"url": "file:///var/folders/r_/r14sy86n2zb91jyz1ptb5b4w0000gn/T/junit8491860562203292911/resources/test_ingest_pipeline_392/issue-392.pdf"
},
"meta": {
"raw": {
"pdf:PDFVersion": "1.4",
"X-Parsed-By": "org.apache.tika.parser.pdf.PDFParser",
"xmp:CreatorTool": "Writer",
"access_permission:modify_annotations": "true",
"access_permission:can_print_degraded": "true",
"meta:creation-date": "2017-06-30T05:11:03Z",
"created": "Thu Jun 29 22:11:03 MST 2017",
"access_permission:extract_for_accessibility": "true",
"access_permission:assemble_document": "true",
"xmpTPg:NPages": "1",
"Creation-Date": "2017-06-30T05:11:03Z",
"dcterms:created": "2017-06-30T05:11:03Z",
"dc:format": "application/pdf; version=1.4",
"access_permission:extract_content": "true",
"access_permission:can_print": "true",
"pdf:docinfo:creator_tool": "Writer",
"access_permission:fill_in_form": "true",
"pdf:encrypted": "false",
"producer": "LibreOffice 5.1",
"access_permission:can_modify": "true",
"pdf:docinfo:producer": "LibreOffice 5.1",
"pdf:docinfo:created": "2017-06-30T05:11:03Z",
"Content-Type": "application/pdf"
}
},
"ip_addr": "10.21.23.123",
"content": "-10.21.23.123 is the IP address of the PXE.--10.21.23.123 is the IP address of the PXE.----"
}
}
]
}
}
So ip_addr
is here.
I tested this with elasticsearch 5.4.2 and a build made on master branch of FSCrawler. May be there is an issue with the latest SNAPSHOT that has been published though if you downloaded it from Sonatype repo as it might have been built against a working branch.
Which version of FSCrawler are you using?
I have been using the Fscrawler version 2.2, no I have not downloaded from Sonatype repo, I got it from Github only, will you please provide the link so that I can download the latest and working version of Fscrawler?
There is a link in the README: https://github.com/dadoonet/fscrawler#download-fscrawler
I just pushed a new test based on your issue. It will trigger a new build anytime soon and will then publish the latest version of the SNAPSHOT. Wait a bit and wait for Sonatype repo to get a build from today. It will be the "right" version to use.
LMK.
I have downloaded from maven repository, I think it may be having some problem, I will try to download the Fscrawler from Sonatype once you will publish it today.
Thank you so much in advance because I think it will solve my problem as you have already tested it at your end.
Finger crossed!!!
So this is the latest version at the time I'm posting this answer: https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler/2.3-SNAPSHOT/fscrawler-2.3-20170630.081459-56.zip
LMK
I am still not getting the required field i.e.
ip_address"
via grok, my elasticsearch version is 5.4.3 and downloaded the above snapshot which you have mentioned, please see my _settings.json below,
"name" : "fscrawler_test_ingest_pipeline_392",
"fs" : {
"url" : "/tmp/es",
"update_rate" : "15m",
"excludes" : [ "~*" ],
"json_support" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : true,
"add_as_inner_object" : false,
"store_source" : false,
"index_content" : true,
"attributes_support" : false,
"raw_metadata" : true,
"xml_support" : false,
"index_folders" : true,
"lang_detect" : false,
"continue_on_error" : false,
"pdf_ocr" : true
},
"elasticsearch" : {
"nodes" : [ {
"pipeline" : "fscrawler_test_ingest_pipeline_392",
"host" : "127.0.0.1",
"port" : 9200,
"scheme" : "HTTP"
} ],
"type" : "doc",
"bulk_size" : 100,
"flush_interval" : "5s"
},
"rest" : {
"scheme" : "HTTP",
"host" : "127.0.0.1",
"port" : 8080,
"endpoint" : "fscrawler"
}
}
PUT _ingest/pipeline/fscrawler_test_ingest_pipeline_392
{
"description": "Testing Grok on PDF upload",
"processors": [
{
"gsub": {
"field": "content",
"pattern": "\n",
"replacement": "-"
}
},
{
"grok": {
"field": "content",
"patterns": [
"%{DATA}%{IP:ip_addr} %{GREEDYDATA}"
]
}
}
]
}
GET fscrawler_test_ingest_pipeline_392/_search
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "fscrawler_test_ingest_pipeline_392",
"_type": "doc",
"_id": "56375575e669cfc88cdc433e87c219b",
"_score": 1,
"_source": {
"content": """
10.21.23.123 is the IP address of the PXE.
""",
"meta": {
"raw": {
"pdf:PDFVersion": "1.4",
"X-Parsed-By": "org.apache.tika.parser.pdf.PDFParser",
"xmp:CreatorTool": "Writer",
"access_permission:modify_annotations": "true",
"access_permission:can_print_degraded": "true",
"meta:creation-date": "2017-06-30T05:11:03Z",
"created": "Fri Jun 30 10:41:03 IST 2017",
"access_permission:extract_for_accessibility": "true",
"access_permission:assemble_document": "true",
"xmpTPg:NPages": "1",
"Creation-Date": "2017-06-30T05:11:03Z",
"dcterms:created": "2017-06-30T05:11:03Z",
"dc:format": "application/pdf; version=1.4",
"access_permission:extract_content": "true",
"access_permission:can_print": "true",
"pdf:docinfo:creator_tool": "Writer",
"access_permission:fill_in_form": "true",
"pdf:encrypted": "false",
"producer": "LibreOffice 5.1",
"access_permission:can_modify": "true",
"pdf:docinfo:producer": "LibreOffice 5.1",
"pdf:docinfo:created": "2017-06-30T05:11:03Z",
"Content-Type": "application/pdf"
}
},
"file": {
"extension": "pdf",
"content_type": "application/pdf",
"last_modified": "2017-07-01T03:52:13.000+0000",
"indexing_date": "2017-07-01T04:21:34.006+0000",
"filesize": 6811,
"filename": "testing.pdf",
"url": "file:///tmp/es/testing.pdf"
},
"path": {
"root": "824b64ab42d4b63cda6e747e2b80e5",
"virtual": "/testing.pdf",
"real": "/tmp/es/testing.pdf"
}
}
},
{
"_index": "fscrawler_test_ingest_pipeline_392",
"_type": "folder",
"_id": "824b64ab42d4b63cda6e747e2b80e5",
"_score": 1,
"_source": {
"root": "d42b9c57d24cf5db3bd8d332dc35437f",
"virtual": "/",
"real": "/tmp/es"
}
}
]
}
}
Please share your _settings.json file as well so that I compare with yours for more details.
I need to try outside the context of an integration test may be. Here is the test I wrote. https://github.com/dadoonet/fscrawler/commit/83fa1ae620e6a8ea5b4e1a8b0491f6ea9b63169e
Not sure when I'll be able to run it though.
I can reproduce what you are seeing when running from the command line. Investigating...
I think I know what is wrong in your case:
{
"name" : "fscrawler_test_ingest_pipeline_392",
"fs" : {
"url" : "/tmp/es"
},
"elasticsearch" : {
"nodes" : [ {
"pipeline" : "fscrawler_test_ingest_pipeline_392",
"host" : "127.0.0.1",
"port" : 9200,
"scheme" : "HTTP"
} ],
"type" : "doc",
"bulk_size" : 100,
"flush_interval" : "5s"
}
}
should be instead:
{
"name" : "fscrawler_test_ingest_pipeline_392",
"fs" : {
"url" : "/tmp/es"
},
"elasticsearch" : {
"nodes" : [ {
"host" : "127.0.0.1",
"port" : 9200,
"scheme" : "HTTP"
} ],
"pipeline" : "fscrawler_test_ingest_pipeline_392",
"type" : "doc",
"bulk_size" : 100,
"flush_interval" : "5s"
}
}
A pipeline is not for a specific node but a global elasticsearch setting.
But this issue actually reveals something I did not think about. The pipeline is actually executed on both doc and folder documents. Which is totally wrong.
It causes errors while injecting folders:
11:13:57,119 DEBUG [f.p.e.c.f.c.BulkProcessor] Error for job/folder/2388303e676399fcc46c92243c3b125d for null operation: {type=exception, reason=java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [content], caused_by={type=illegal_argument_exception, reason=java.lang.IllegalArgumentException: field [content] not present as part of path [content], caused_by={type=illegal_argument_exception, reason=field [content] not present as part of path [content]}}, header={processor_type=gsub}}
I'm going to open a new issue about it.
I have added "pipeline" thing into
_settings.json
for the ingest pipeline which I have created in the Elasticsearch and that pipeline is working properly.Ingest pipeline which I have created is given below,
Above configuration is not filtering the "content" field as per grok filters which is coming after doing the document filtering, it is displaying the whole content as it is without parsing through grok filters.
Please let me know if I am missing anything.