How to integrate the Ingest Pipeline in Fscrawler

gwalashish commented 7 years ago

I have added "pipeline" thing into _settings.json for the ingest pipeline which I have created in the Elasticsearch and that pipeline is working properly.

"elasticsearch" : {
    "nodes" : [ {
      "pipeline" : "fscrawler",
      "host" : "127.0.0.1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],

Ingest pipeline which I have created is given below,

PUT _ingest/pipeline/fscrawler
{ 
"description" : "Testing Grok on PDF upload", 
"processors" : [ 
{ 
    "grok": { 
        "field": "content", 
          "patterns":  ["%{IP:ip} %{GREEDYDATA}"] 
        }
 } 
] 
}

Above configuration is not filtering the "content" field as per grok filters which is coming after doing the document filtering, it is displaying the whole content as it is without parsing through grok filters.

Please let me know if I am missing anything.

dadoonet commented 7 years ago

I believe that what has been extracted does not match the grok pattern you defined?

Can you just try to set in your pipeline a field instead?

{
  "set": {
    "field": "foo",
    "value": "bar"
  }
}

gwalashish commented 7 years ago

My document contains "10.23.22.22 is the IP address of the server", grok pattern is matching properly when I did the simulation but using fscrawler it was not working,

PUT _ingest/pipeline/fscrawler
{ 
"description" : "Testing Grok on PDF upload", 
"processors" : [ 
{ 
    "grok": { 
        "field": "content", 
          "patterns":  ["%{IP:ip} %{GREEDYDATA}"] 
        }
 } 
] 
}

I have made the changes into pipeline as you have mentioned above,

PUT _ingest/pipeline/fscrawler
{ 
"description" : "Testing Grok on PDF upload", 
"processors" : [ 
{ 
    "set": { 
        "field": "content", 
          "value":  "Test 123"
        }
 } 
] 
}

After doing this, I was getting error like "field is a required parameter" while seeing the index in discover.

Please tell me if I am doing anything wrong.

dadoonet commented 7 years ago

Can you share what is the JSON _source for your PDF document once it has been parsed by FSCrawler?

GET index/doc/id

gwalashish commented 7 years ago

It says "found' : "false".

dadoonet commented 7 years ago

Of course. You need to replace with your index name, the right type and the right id of the document.

You can find them by running a search.

gwalashish commented 7 years ago

I have added the same thing,

GET pdf_upload/_search

but it does not give any information which we are looking for. If I do the same thing on other indexes then it gives the _id, _type information.

dadoonet commented 7 years ago

but it does not give any information which we are looking for

May be. I can't tell as I can't see it.

Anyway, can you remove the "pipeline" : "fscrawler" from your fscrawler setting, try again and give back the result of the search? Please share also your full fscrawler config file. And please format it in github as it's more readable.

Thanks.

gwalashish commented 7 years ago

Even after removing "pipeline" : "fscrawler" from setting file it was not working, I have deleted the whole job directory and recreated the same again. After that index got created.

{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 2, "max_score": 1, "hits": [ { "_index": "pdf_upload", "_type": "folder", "_id": "824b64ab42d4b63cda6e747e2b80e5", "_score": 1, "_source": { "encoded": "824b64ab42d4b63cda6e747e2b80e5", "root": "824b64ab42d4b63cda6e747e2b80e5", "real": "/tmp/es" } }, { "_index": "pdf_upload", "_type": "doc", "_id": "8c3f1f54665e48419b1a2313dd21624", "_score": 1, "_source": { "content": """ 10.21.23.123 is the IP address of the PXE server. """, "meta": { "raw": { "pdf:PDFVersion": "1.4", "X-Parsed-By": "org.apache.tika.parser.DefaultParser", "xmp:CreatorTool": "Writer", "access_permission:modify_annotations": "true", "access_permission:can_print_degraded": "true", "meta:creation-date": "2017-06-29T07:16:03Z", "created": "Thu Jun 29 12:46:03 IST 2017", "access_permission:extract_for_accessibility": "true", "access_permission:assemble_document": "true", "xmpTPg:NPages": "1", "Creation-Date": "2017-06-29T07:16:03Z", "dcterms:created": "2017-06-29T07:16:03Z", "dc:format": "application/pdf; version=1.4", "access_permission:extract_content": "true", "access_permission:can_print": "true", "pdf:docinfo:creator_tool": "Writer", "access_permission:fill_in_form": "true", "pdf:encrypted": "false", "producer": "LibreOffice 5.1", "access_permission:can_modify": "true", "pdf:docinfo:producer": "LibreOffice 5.1", "pdf:docinfo:created": "2017-06-29T07:16:03Z", "Content-Type": "application/pdf" } }, "file": { "extension": "pdf", "content_type": "application/pdf", "last_modified": "2017-06-29T12:47:54", "indexing_date": "2017-06-29T17:12:40.712", "filesize": 6899, "filename": "test_pdf.pdf", "url": "file:///tmp/es/test_pdf.pdf" }, "path": { "encoded": "824b64ab42d4b63cda6e747e2b80e5", "root": "824b64ab42d4b63cda6e747e2b80e5", "virtual": "/", "real": "/tmp/es/test_pdf.pdf" } } } ] } }

Please see below the fscrawler's setting file,

{ "name" : "pdf_upload", "fs" : { "url" : "/tmp/es", "update_rate" : "15m", "excludes" : [ "~*" ], "json_support" : false, "filename_as_id" : false, "add_filesize" : true, "remove_deleted" : true, "add_as_inner_object" : false, "store_source" : false, "index_content" : true, "attributes_support" : false, "raw_metadata" : true, "xml_support" : false, "index_folders" : true, "lang_detect" : false }, "elasticsearch" : { "nodes" : [ { "host" : "XX.XX.XX.XX", "port" : 9200, "scheme" : "HTTP" } ], "type" : "doc", "bulk_size" : 100, "flush_interval" : "5s" }, "rest" : { "scheme" : "HTTP", "host" : "127.0.0.1", "port" : 8080, "endpoint" : "fscrawler" } }

Please let me know what I am missing now.

dadoonet commented 7 years ago

Please format the code. Don't quote.

If you did not touch the JSON content I can see that your document is generated as:


10.21.23.123 is the IP address of the PXE server.

You can see that there is a \n at the beginning. And I think that this is not going to match %{IP:ip} %{GREEDYDATA}.

gwalashish commented 7 years ago

I am trying to map the "\n" but I was not able to map it by giving below grok after substituting the whole content field, please let me know what could be the problem because in simulator "\n" is getting replaced by the "-", so that pipeline is working fine. please see pipeline configuration below

PUT _ingest/pipeline/pdfgrep
{ 
"description" : "Testing Grok on PDF upload", 
"processors" : [ 
  { 
    "gsub": {
      "field": "content",
      "pattern": "\n",
      "replacement": "-"
    },
    "grok": { 
      "field": "content",
      "patterns":  ["%{DATA}%{IP:ip_addr} %{GREEDYDATA}"]
    }
  }
]
}

Without using pipeline in fscrawler, this was the output of

GET pdf_upload/doc/8c3f1f54665e48419b1a2313dd21624

{
  "_index": "pdf_upload",
  "_type": "doc",
  "_id": "8c3f1f54665e48419b1a2313dd21624",
  "_version": 1,
  "_score": null,
  "_source": {
    "content": "\n10.21.23.123 is the IP address of the PXE server.\n\n\n",
    "meta": {
      "raw": {
        "pdf:PDFVersion": "1.4",
        "X-Parsed-By": "org.apache.tika.parser.DefaultParser",
        "xmp:CreatorTool": "Writer",
        "access_permission:modify_annotations": "true",
        "access_permission:can_print_degraded": "true",
        "meta:creation-date": "2017-06-29T07:16:03Z",
        "created": "Thu Jun 29 12:46:03 IST 2017",
        "access_permission:extract_for_accessibility": "true",
        "access_permission:assemble_document": "true",
        "xmpTPg:NPages": "1",
        "Creation-Date": "2017-06-29T07:16:03Z",
        "dcterms:created": "2017-06-29T07:16:03Z",
        "dc:format": "application/pdf; version=1.4",
        "access_permission:extract_content": "true",
        "access_permission:can_print": "true",
        "pdf:docinfo:creator_tool": "Writer",
        "access_permission:fill_in_form": "true",
        "pdf:encrypted": "false",
        "producer": "LibreOffice 5.1",
        "access_permission:can_modify": "true",
        "pdf:docinfo:producer": "LibreOffice 5.1",
        "pdf:docinfo:created": "2017-06-29T07:16:03Z",
        "Content-Type": "application/pdf"
      }
    },
    "file": {
      "extension": "pdf",
      "content_type": "application/pdf",
      "last_modified": "2017-06-29T12:47:54",
      "indexing_date": "2017-06-30T10:44:54.864",
      "filesize": 6899,
      "filename": "test_pdf.pdf",
      "url": "file:///tmp/es/test_pdf.pdf"
    },
    "path": {
      "encoded": "824b64ab42d4b63cda6e747e2b80e5",
      "root": "824b64ab42d4b63cda6e747e2b80e5",
      "virtual": "/",
      "real": "/tmp/es/test_pdf.pdf"
    }
  },
  "fields": {
    "file.last_modified": [
      1498740474000
    ],
    "meta.raw.Creation-Date": [
      1498720563000
    ],
    "meta.raw.meta:creation-date": [
      1498720563000
    ],
    "meta.raw.pdf:docinfo:created": [
      1498720563000
    ],
    "file.indexing_date": [
      1498819494864
    ],
    "meta.raw.dcterms:created": [
      1498720563000
    ]
  },
  "sort": [
    1498740474000
  ]
}

After seeing the above content field which contains "\n" so based on the I have simulated the same thing, please see below,

POST _ingest/pipeline/pdfgrep/_simulate 
{ "docs": [ 
  { "_source": 
    { "content": "\n10.21.23.123 is the IP address of the PXE server.\n\n\n" 
    } 
  } 
  ] 
  }

Output of the simulator is given below, please see,

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_type": "_type",
        "_id": "_id",
        "_source": {
          "ip_addr": "10.21.23.123",
          "content": "-10.21.23.123 is the IP address of the PXE server.---"
        },
        "_ingest": {
          "timestamp": "2017-06-30T05:27:13.563Z"
        }
      }
    }
  ]
}

If you see above everything should work but when I was using fscrawler with "pipeline" : "pdfgrep" then it was not working, giving error like "field" parameter is now invalid. Please select a new field in the discover.

{
  "name" : "pdf_upload",
  "fs" : {
    "url" : "/tmp/es",
    "update_rate" : "15m",
    "excludes" : [ "~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false
  },
  "elasticsearch" : {
    "nodes" : [ {
      "pipeline" : "pdfgrep",
      "host" : "XX.XX.XX.XX",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "type" : "doc",
    "bulk_size" : 100,
    "flush_interval" : "5s"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}

Please let me know if you can find anything wrong in the above process.

dadoonet commented 7 years ago

Can you share your PDF document?

gwalashish commented 7 years ago

Please see the PDF document which has been attached. testing.pdf

dadoonet commented 7 years ago

This is weird. I can see that kind of errors in elasticsearch logs:

[2017-06-30T09:29:50,649][DEBUG][o.e.a.b.TransportBulkAction] [wsSoTCn] failed to execute pipeline [fscrawler_test_ingest_pipeline_392] for document [fscrawler_test_ingest_pipeline_392/folder/f614ecb527212f3abd1a7befab87132]
org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [content]
    at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-5.4.2.jar:5.4.2]
    at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-5.4.2.jar:5.4.2]
    at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-5.4.2.jar:5.4.2]
    at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:166) ~[elasticsearch-5.4.2.jar:5.4.2]
    at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:41) ~[elasticsearch-5.4.2.jar:5.4.2]
    at org.elasticsearch.ingest.PipelineExecutionService$2.doRun(PipelineExecutionService.java:88) [elasticsearch-5.4.2.jar:5.4.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.2.jar:5.4.2]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.2.jar:5.4.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [content]
    ... 11 more
Caused by: java.lang.IllegalArgumentException: field [content] not present as part of path [content]
    at org.elasticsearch.ingest.IngestDocument.resolve(IngestDocument.java:340) ~[elasticsearch-5.4.2.jar:5.4.2]
    at org.elasticsearch.ingest.IngestDocument.getFieldValue(IngestDocument.java:108) ~[elasticsearch-5.4.2.jar:5.4.2]
    at org.elasticsearch.ingest.common.GsubProcessor.execute(GsubProcessor.java:67) ~[?:?]
    at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-5.4.2.jar:5.4.2]
    ... 9 more

field [content] not present as part of path [content]... I don't get it. I need to dive into this. Stay tuned.

dadoonet commented 7 years ago

Forget it. That was a stupid thing in my test suite. It's working fine now. I have been able to parse your document but note that I changed the pipeline job as I believe there is an error in it although it seems to be accepted by ingest and is working well when using simulation.

So at the end, here is what I have defined:

PUT _ingest/pipeline/fscrawler_test_ingest_pipeline_392
{
  "description": "Testing Grok on PDF upload",
  "processors": [
    {
      "gsub": {
        "field": "content",
        "pattern": "\n",
        "replacement": "-"
      }
    },
    {
      "grok": {
        "field": "content",
        "patterns": [
          "%{DATA}%{IP:ip_addr} %{GREEDYDATA}"
        ]
      }
    }
  ]
}

Which gives after running FSCrawler on your PDF doc:

GET fscrawler_test_ingest_pipeline_392/_search

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "fscrawler_test_ingest_pipeline_392",
        "_type": "doc",
        "_id": "e16c878ca9b865328b9d2b28daca7183",
        "_score": 1,
        "_source": {
          "path": {
            "virtual": "/issue-392.pdf",
            "root": "f641fc38a5712ab557af652cbbbdc",
            "real": "/var/folders/r_/r14sy86n2zb91jyz1ptb5b4w0000gn/T/junit8491860562203292911/resources/test_ingest_pipeline_392/issue-392.pdf"
          },
          "file": {
            "extension": "pdf",
            "filename": "issue-392.pdf",
            "content_type": "application/pdf",
            "indexing_date": "2017-06-30T07:34:38.059+0000",
            "filesize": 6811,
            "last_modified": "2017-06-30T07:34:27.000+0000",
            "url": "file:///var/folders/r_/r14sy86n2zb91jyz1ptb5b4w0000gn/T/junit8491860562203292911/resources/test_ingest_pipeline_392/issue-392.pdf"
          },
          "meta": {
            "raw": {
              "pdf:PDFVersion": "1.4",
              "X-Parsed-By": "org.apache.tika.parser.pdf.PDFParser",
              "xmp:CreatorTool": "Writer",
              "access_permission:modify_annotations": "true",
              "access_permission:can_print_degraded": "true",
              "meta:creation-date": "2017-06-30T05:11:03Z",
              "created": "Thu Jun 29 22:11:03 MST 2017",
              "access_permission:extract_for_accessibility": "true",
              "access_permission:assemble_document": "true",
              "xmpTPg:NPages": "1",
              "Creation-Date": "2017-06-30T05:11:03Z",
              "dcterms:created": "2017-06-30T05:11:03Z",
              "dc:format": "application/pdf; version=1.4",
              "access_permission:extract_content": "true",
              "access_permission:can_print": "true",
              "pdf:docinfo:creator_tool": "Writer",
              "access_permission:fill_in_form": "true",
              "pdf:encrypted": "false",
              "producer": "LibreOffice 5.1",
              "access_permission:can_modify": "true",
              "pdf:docinfo:producer": "LibreOffice 5.1",
              "pdf:docinfo:created": "2017-06-30T05:11:03Z",
              "Content-Type": "application/pdf"
            }
          },
          "ip_addr": "10.21.23.123",
          "content": "-10.21.23.123 is the IP address of the PXE.--10.21.23.123 is the IP address of the PXE.----"
        }
      }
    ]
  }
}

So ip_addr is here.

I tested this with elasticsearch 5.4.2 and a build made on master branch of FSCrawler. May be there is an issue with the latest SNAPSHOT that has been published though if you downloaded it from Sonatype repo as it might have been built against a working branch.

Which version of FSCrawler are you using?

gwalashish commented 7 years ago

I have been using the Fscrawler version 2.2, no I have not downloaded from Sonatype repo, I got it from Github only, will you please provide the link so that I can download the latest and working version of Fscrawler?

dadoonet commented 7 years ago

There is a link in the README: https://github.com/dadoonet/fscrawler#download-fscrawler

dadoonet commented 7 years ago

I just pushed a new test based on your issue. It will trigger a new build anytime soon and will then publish the latest version of the SNAPSHOT. Wait a bit and wait for Sonatype repo to get a build from today. It will be the "right" version to use.

LMK.

gwalashish commented 7 years ago

I have downloaded from maven repository, I think it may be having some problem, I will try to download the Fscrawler from Sonatype once you will publish it today.

Thank you so much in advance because I think it will solve my problem as you have already tested it at your end.

Finger crossed!!!

dadoonet commented 7 years ago

So this is the latest version at the time I'm posting this answer: https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler/2.3-SNAPSHOT/fscrawler-2.3-20170630.081459-56.zip

LMK

gwalashish commented 7 years ago

I am still not getting the required field i.e.

ip_address"

via grok, my elasticsearch version is 5.4.3 and downloaded the above snapshot which you have mentioned, please see my _settings.json below,

  "name" : "fscrawler_test_ingest_pipeline_392",
  "fs" : {
    "url" : "/tmp/es",
    "update_rate" : "15m",
    "excludes" : [ "~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "pdf_ocr" : true
  },
  "elasticsearch" : {
    "nodes" : [ {
      "pipeline" : "fscrawler_test_ingest_pipeline_392",
      "host" : "127.0.0.1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "type" : "doc",
    "bulk_size" : 100,
    "flush_interval" : "5s"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}

PUT _ingest/pipeline/fscrawler_test_ingest_pipeline_392
{
  "description": "Testing Grok on PDF upload",
  "processors": [
    {
      "gsub": {
        "field": "content",
        "pattern": "\n",
        "replacement": "-"
      }
    },
    {
      "grok": {
        "field": "content",
        "patterns": [
          "%{DATA}%{IP:ip_addr} %{GREEDYDATA}"
        ]
      }
    }
  ]
}

GET fscrawler_test_ingest_pipeline_392/_search

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "fscrawler_test_ingest_pipeline_392",
        "_type": "doc",
        "_id": "56375575e669cfc88cdc433e87c219b",
        "_score": 1,
        "_source": {
          "content": """

10.21.23.123 is the IP address of the PXE.

""",
          "meta": {
            "raw": {
              "pdf:PDFVersion": "1.4",
              "X-Parsed-By": "org.apache.tika.parser.pdf.PDFParser",
              "xmp:CreatorTool": "Writer",
              "access_permission:modify_annotations": "true",
              "access_permission:can_print_degraded": "true",
              "meta:creation-date": "2017-06-30T05:11:03Z",
              "created": "Fri Jun 30 10:41:03 IST 2017",
              "access_permission:extract_for_accessibility": "true",
              "access_permission:assemble_document": "true",
              "xmpTPg:NPages": "1",
              "Creation-Date": "2017-06-30T05:11:03Z",
              "dcterms:created": "2017-06-30T05:11:03Z",
              "dc:format": "application/pdf; version=1.4",
              "access_permission:extract_content": "true",
              "access_permission:can_print": "true",
              "pdf:docinfo:creator_tool": "Writer",
              "access_permission:fill_in_form": "true",
              "pdf:encrypted": "false",
              "producer": "LibreOffice 5.1",
              "access_permission:can_modify": "true",
              "pdf:docinfo:producer": "LibreOffice 5.1",
              "pdf:docinfo:created": "2017-06-30T05:11:03Z",
              "Content-Type": "application/pdf"
            }
          },
          "file": {
            "extension": "pdf",
            "content_type": "application/pdf",
            "last_modified": "2017-07-01T03:52:13.000+0000",
            "indexing_date": "2017-07-01T04:21:34.006+0000",
            "filesize": 6811,
            "filename": "testing.pdf",
            "url": "file:///tmp/es/testing.pdf"
          },
          "path": {
            "root": "824b64ab42d4b63cda6e747e2b80e5",
            "virtual": "/testing.pdf",
            "real": "/tmp/es/testing.pdf"
          }
        }
      },
      {
        "_index": "fscrawler_test_ingest_pipeline_392",
        "_type": "folder",
        "_id": "824b64ab42d4b63cda6e747e2b80e5",
        "_score": 1,
        "_source": {
          "root": "d42b9c57d24cf5db3bd8d332dc35437f",
          "virtual": "/",
          "real": "/tmp/es"
        }
      }
    ]
  }
}

Please share your _settings.json file as well so that I compare with yours for more details.

dadoonet commented 7 years ago

I need to try outside the context of an integration test may be. Here is the test I wrote. https://github.com/dadoonet/fscrawler/commit/83fa1ae620e6a8ea5b4e1a8b0491f6ea9b63169e

Not sure when I'll be able to run it though.

dadoonet commented 7 years ago

I can reproduce what you are seeing when running from the command line. Investigating...

dadoonet commented 7 years ago

I think I know what is wrong in your case:

{
  "name" : "fscrawler_test_ingest_pipeline_392",
  "fs" : {
    "url" : "/tmp/es"
  },
  "elasticsearch" : {
    "nodes" : [ {
      "pipeline" : "fscrawler_test_ingest_pipeline_392",
      "host" : "127.0.0.1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "type" : "doc",
    "bulk_size" : 100,
    "flush_interval" : "5s"
  }
}

should be instead:

{
  "name" : "fscrawler_test_ingest_pipeline_392",
  "fs" : {
    "url" : "/tmp/es"
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "127.0.0.1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "pipeline" : "fscrawler_test_ingest_pipeline_392",
    "type" : "doc",
    "bulk_size" : 100,
    "flush_interval" : "5s"
  }
}

A pipeline is not for a specific node but a global elasticsearch setting.

dadoonet commented 7 years ago

But this issue actually reveals something I did not think about. The pipeline is actually executed on both doc and folder documents. Which is totally wrong.

It causes errors while injecting folders:

11:13:57,119 DEBUG [f.p.e.c.f.c.BulkProcessor] Error for job/folder/2388303e676399fcc46c92243c3b125d for null operation: {type=exception, reason=java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [content] not present as part of path [content], caused_by={type=illegal_argument_exception, reason=java.lang.IllegalArgumentException: field [content] not present as part of path [content], caused_by={type=illegal_argument_exception, reason=field [content] not present as part of path [content]}}, header={processor_type=gsub}}

I'm going to open a new issue about it.

dadoonet / fscrawler

How to integrate the Ingest Pipeline in Fscrawler #392