elastic / elasticsearch-php

Official PHP client for Elasticsearch.
https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/index.html
MIT License
5.31k stars 969 forks source link

Bulk Indexing #177

Closed Blackhawk2165 closed 7 years ago

Blackhawk2165 commented 9 years ago

Hello,

My name is Austin Harmon and I am new to elastic search. I am looking to index a couple hundred thousand documents with elastic search and I would like to use the php client to do it. I have my index set up with one shard and one replica since I have a smaller amount of documents. I have looked over all the syntax on the elastic search site and github. This is what my index.php file looks like:

<?php require 'vendor/autoload.php';

$client = new Elasticsearch\Client(); $indexParams['index'] = 'rvuehistoricaldocuments2009-2013'; //mapping code

for ($i = 0; $i <= 50; $i++) {
    $params ['body'][] = array(
        'index' => array(
            'index' => 'rvuehistoricaldocuments2009-2013',
            'type' => 'documents',
            '_id' => $i,
            'body' => array(
                '_source' => array( //everthing in body is mapping code
                    'enabled' => true
                ),
                'properties' => array(
                    'doc_name' => array(
                        'type' => 'string',
                        'analyzer' => 'standard'
                    ),
                    'description' => array(
                        'type' => 'string'
                    )
                )
    );

    $params['body'][] = array (
        'rVuedoc'$i => 'ID'$i

    );

    $indexParams['body']['mappings']['documents'] = $myTypeMapping; //mapping code
}

json_encode($params);

$client->indices()->create($indexParams); //mapping code
$responses = $client->bulk($params);

?>

I'm not sure if I have everything I need or if I'm doing this right so if you could let me know if this looks correct that would be very helpful.

Thank you, Austin Harmon

polyfractal commented 9 years ago

You've mixed index/mapping creation with bulk indexing. That syntax won't work. First you need to create the index with mappings, then start bulk indexing into the newly created index.

Non-tested code, but something like this is more what you are looking for:

require 'vendor/autoload.php';

$client = new Elasticsearch\Client();

// Create the index and mappings
$mapping['index'] = 'rvuehistoricaldocuments2009-2013'; //mapping code
$mapping['body'] = array(
    'mappings' => array(
        'documents' => array(
            '_source' => array(
                'enabled' => true
            ),
            'properties' => array(
                'doc_name' => array(
                    'type' => 'string',
                    'analyzer' => 'standard'
                ),
                'description' => array(
                    'type' => 'string'
                )
            )
        )
    )
);

$client->indices()->create($mapping);

// Now index the documents
for ($i = 0; $i <= 10000; $i++) {
    $params ['body'][] = array(
        'index' => array(
            'index' => 'rvuehistoricaldocuments2009-2013',
            'type' => 'documents',
            '_id' => $i,
            'body' => array(
                'foo' => 'bar'// Document body goes here
            )
        )
    );

    // Every 1000 documents stop and send the bulk request
    if ($i % 1000) {
        $responses = $client->bulk($params);

        // erase the old bulk request
        $params = array();

        // unset the bulk response when you are done to save memory
        unset($responses);
    }
}
Blackhawk2165 commented 9 years ago

Okay that makes more sense. I found a tutorial on indexing and then I integrated the mapping so I must have put it together wrong. I have one more question for now. How does elastic search know where the directories are that have the files? I have some data saved on the machine I have elastic search running on so I was going to index those documents first and then index documents on a large multi-terabyte external drive. Where do I specify where to look for the documents.

Thanks for your help, Austin Harmon

polyfractal commented 9 years ago

No problem, happy to help :)

How does elastic search know where the directories are that have the files

Elasticsearch doesn't know where the files are at all. You have to write the "plumbing" code that imports data into Elasticsearch. So you'll just need to tell your PHP script where to load the data (e.g. file_get_contents() or similar), then start constructing bulk requests.

ES doesn't have any "import" functionality or anything. It all has to be inserted via your code.

Blackhawk2165 commented 9 years ago

So before I create the index and mapping I need to put that function or something similar with the file path in order to index properly? Also I have a directory that has a directory that has a bunch of directories and then files. Should I put the path to each directory with files in it individually or can I just stop at the directory that has all the different directories in it?

Blackhawk2165 commented 9 years ago

Also I wanted to ask you about JSON encoding I noticed you got rid of the json_encode() function. Why did you delete that?

polyfractal commented 9 years ago

You'll have to load all the data yourself, which would include recursively loading directories to find files if that's how your data is stored. E.g open a file, then parse it into an array and index:


$documentData = file_get_contents('/path/to/document/data.json');

// If the data is json, you can decode it
$documentData = json_decode($documentData);

// Or if the data was a csv, maybe split by line
//$documentData = explode("\n", $documentData);

// etc etc.  Depends on what format your input data is

// Now index the documents
for ($i = 0; $i <= count($documentData); $i++) {
    $params ['body'][] = array(
        'index' => array(
            'index' => 'rvuehistoricaldocuments2009-2013',
            'type' => 'documents',
            '_id' => $i,
            'body' => array(
                $documentData[$i]
            )
        )
    );

    // Every 1000 documents stop and send the bulk request
    if ($i % 1000) {
        $responses = $client->bulk($params);

        // erase the old bulk request
        $params = array();

        // unset the bulk response when you are done to save memory
        unset($responses);
    }
}

Also I wanted to ask you about JSON encoding I noticed you got rid of the json_encode() function. Why did you delete that?

The client will automatically serialize PHP arrays into valid JSON for you. So you just need to provide a PHP array of the data you want to index.

Some API endpoints, like the Bulk API, have special JSON syntax, which is why the client handles it for you.

Blackhawk2165 commented 9 years ago

So I have a bunch of documents in all different formats like .docx, .csv, .ppt, .pdf, etc. Do I have to get them all in one format or can I just put them into an array and index?

Blackhawk2165 commented 9 years ago

So can I index an entire directory as long as there are only files in it? or do I have to index each document separately?

Blackhawk2165 commented 9 years ago

Another question for you. Can the file names have spaces in them? I know if I want to cd into the directory with linux they can't but when indexing them and putting them in the file path in the php script does that matter or do I need to eliminate all spaces?

polyfractal commented 9 years ago

So I have a bunch of documents in all different formats like .docx, .csv, .ppt, .pdf, etc. Do I have to get them all in one format or can I just put them into an array and index? So can I index an entire directory as long as there are only files in it? or do I have to index each document separately?

Elasticsearch has no concept of files, directories, folders, disk drives, etc.

Elasticsearch only understands JSON formatting. So you will need to load those documents and somehow parse/transform them into JSON documents containing simple field : value pairs. It's just like with a database: you can't insert a .docx into MySQL...you have to insert a row which contains columns and values. With MySQL, you would first transform your data into some kind of row representation before inserting.

So with ES, you have to load/transform data and insert it as JSON.

There is an attachment plugin for Elasticsearch which might help you, although it can be limiting at times.

I'd recommend sitting down the the Elasticsearch - Definitive Guide book and getting to know ES a little bit better. It sounds like there are some fundamental concepts that you should learn first before moving forward...it'll make the whole experience a lot better if you have some solid fundamentals about how ES operates.

Blackhawk2165 commented 9 years ago

So after doing a lot of reading and studying up, if I only wanted one node with one shard and also only want one index (rvuehistoricaldocuments2009-2013) with one type (documents) then would it be easier to put all the files I want to index under one directory, since they will all be the same type I can just run a loop with auto generated id's and index the documents that way.

polyfractal commented 9 years ago

Yeah, I think that would probably be the simplest way to do it. Otherwise you'll have to mess around with recursive directory scanning, which isn't the most pleasant thing to do in PHP :)

Blackhawk2165 commented 9 years ago

how does this look:

<?php

require 'vendor/autoload.php';

$client = new Elasticsearch/Client();

//Create the index and mappings $mapping['index'] = 'rvuehistoricaldocuments2009-2013'; //mapping code $mapping['body'] = array ( 'mappings' => array ( 'documents' => array ( '_source' => array ( 'enabled' => true ), 'properties' => array( 'doc_name' => array( 'type' => 'string', 'analyzer' => 'standard' ), 'description' => array( 'type' => 'string' ) ) ) ) );

$client ->indices()->create($mapping)

$documentData = file_get_contents('~/elkdata/for_elk_test_2014_11_24/Documents'

//Now index the documents

for ($i = 0; $i <= count($documentData); $i++) { $params ['body'][] = array( 'index' => array( 'type' => 'documents' 'body' => array( 'foo' => 'bar' //Document body goes here

        )
    )
);

//Every 1000 documents stop and send the bulk request.

if($1 % 1000) {
    $responses = $client->bulk($params);

// erase the old bulk request
$params = array();

// unset the bulk response when you are done to save memory
unset($responses);
}

} ?>

I didn't do any json_encode because when I go to index the documents I am putting them in a array which puts them in json format for me correct?

Thanks again for helping me out, I'm new to php and elastic search so you have been a tremendous help!

Blackhawk2165 commented 9 years ago

So it turns out that there is way to much data to just shove it all in one directory. I thought it would be a simple solution, but I got denied :( anyways I've heard that these tasks are easier to do in perl. Can I use perl to index everything and then the php client to write the rest or does mixing clients not work?

Blackhawk2165 commented 9 years ago

Have you seen this function before? http://php.net/manual/en/function.scandir.php

polyfractal commented 9 years ago

Can I use perl to index everything and then the php client to write the rest or does mixing clients not work?

Sure. There is also an Elasticsearch client for Perl. The syntax is a little different (more perl-y) but should be relatively similar. You could load up all the docs into elasticsearch using the Perl client, then use PHP to run queries, etc. There are also clients for Java, .Net, Groovy, Ruby, Python and Javascript, depending on your preference :)

At the end of the day, all the clients are just building HTTP requests to send to the server, so they are all interchangeable really.

Have you seen this function before? http://php.net/manual/en/function.scandir.php

Yep, scandir is one way to do a directory search. You could also use dir/openDir or globs (discussed here). I would probably use a RecursiveIteratorIterator approach, as detailed here:

http://stackoverflow.com/a/14305746 and http://stackoverflow.com/a/2398163

Blackhawk2165 commented 9 years ago

you are awesome man, thanks for everything. You have been a huge help. I will let you know if I have anymore issues.

Blackhawk2165 commented 9 years ago

Hello,

I've added recursive code into my file. Should I include all of this into one file or should I split them up?

Here is my code:

<?php

require 'vendor/autoload.php';

$client = new Elasticsearch/Client();

$root = realpath('~/elkdata/for_elk_test_2014_11_24/Agencies');

$iter = new RecursiveIteratorIterator( new RecursiveDirectoryIterator($root, RecursiveDirectoryIterator::SKIP_DOTS), RecursiveIteratorIterator::SELF_FIRST, RecursiveIteratorIterator::CATCH_GET_CHILD);

$paths = array($root); foreach ($iter as $path => $dir) { if ($dir -> isDir()) { $paths[] = $path; } }

//Create the index and mappings $mapping['index'] = 'rvuehistoricaldocuments2009-2013'; //mapping code $mapping['body'] = array ( 'mappings' => array ( 'documents' => array ( '_source' => array ( 'enabled' => true ), 'properties' => array( 'doc_name' => array( 'type' => 'string', 'analyzer' => 'standard' ), 'description' => array( 'type' => 'string' ) ) ) ) );

$client ->indices()->create($mapping)

//Now index the documents

for ($i = 0; $i <= count($paths); $i++) { $params ['body'][] = array( 'index' => array( 'type' => 'documents' 'body' => array( 'foo' => 'bar' //Document body goes here

        )
    )
);

//Every 1000 documents stop and send the bulk request.

if($1 % 1000) {
    $responses = $client->bulk($params);

// erase the old bulk request
$params = array();

// unset the bulk response when you are done to save memory
unset($responses);
}

} ?>

Sorry I keep just posting my code int he text editor. I don't have an option for code.

Thanks again, Austin Harmon

Blackhawk2165 commented 9 years ago

Hello,

I ran into an issue with the Elasticsearch/Client() class being found. The error that comes up is:

PHP Fatal error: Class 'Elasticsearch' not found in home/aharmon/php-files/newindex.php on line 5

my line 5 looks like this: $client = new Elasticsearch/Client();

do I need to have another require or include so that the class is recognized?

polyfractal commented 9 years ago

Yep, you need to include Composer's autoloader (which then loads the various classes on-demand):

require 'vendor/autoload.php';

$client = new Elasticsearch/Client();
Blackhawk2165 commented 9 years ago

Thats what I have,

1 <?php 2 3 require '/home/aharmon/vendor/autoload.php'; 4 5 $client = new Elasticsearch/Client();

then I go into the recursive code.

I've been looking up possible solutions and I see that some people are having this issue when APC is enabled. Is this relevant?

Blackhawk2165 commented 9 years ago

Okay so I think my issue is that when I run composer.phar I get an error.

I set up my composer.json file which looks like this:

{ "require": { "elasticsearch/elasticseearch": "1.3.2" } }

I originally had the ~1.0 but then it was just giving me errors for every version of elastic search so i narrowed it down to the version i downloaded.

The problem that occurs is:

is the problem that I am missing the curl extension for PHP.

polyfractal commented 9 years ago

Ah, so it probably didn't install ES-PHP at all then, since it couldn't satisfy the curl extension requirement.

Yeah, the php libcurl extension is required for the client to work. Curl is the HTTP transport that PHP uses to send requests to the server. You'll need to make sure your php installation has the extension installed (either compiled in, or loaded as a dynamic extension).

Blackhawk2165 commented 9 years ago

Do I need to put the elastic search directory name in the autoload_namespaces.php file that comes with composer or will that take care of itself once I install php5-curl and run composer.phar?

Blackhawk2165 commented 9 years ago

Okay so I got php5-curl installed there was some weird error with the version of php-common that I had downloaded. Anyways I am still getting the same error.

PHP Fatal error: Class 'Elasticsearch' not found in /home/aharmon/php-files/newindex.php on line 5

line 5 is this:

$client = new Elasticsearch/Client();

when I ran the composer.phar file it showed the composer.json file with the elastic search information in it being processed into the autoload.

I shouldn't need to re-install elastic search and everything now that I have php5-curl installed right?

Blackhawk2165 commented 9 years ago

So I've done some digging into the directories now that I got composer to install. For my composer.json file do I need the whole plath to Client.php?

so instead of:

require: elasticsearch/elasticsearch

should I have: /vendor/elasticsearch/elasticsearch/src/Elasticsearch/Client.php

or am I looking into the wrong things?

Blackhawk2165 commented 9 years ago

Hello,

So I got everything working working on a new machine. It turns out the machine I was working on had faulty hardware haha.

Now I am just going through my php script to get it to work.

I got a strange error though. I know I've been posting a lot with code questions, but this error has something to do with the files that the composer installed.

Here is the error: PHP Fatal error: Uncaught exception 'Guzzle\Http\Exception\ServerErrorResponseException' with message 'Server error resonse [status code] 500 [reason phrase] Internal Server Error [url] http://localhost:9200/_all/_bulk' in /home/aharmon/vendor/guzzle/http/Guzzle/Http/Exception/BadResponseException.php: 43 Stack trace:

0 /home/aharmon/vendor/guzzle/http/Guzzle/Http/Message/Request.php(145): Guzzle\Http\Exception\BadResponseException::factory(Object(Guzzle\Http\Message\EntityEnclosingRequest), Object (Guzzle\Http\Message\Response))

1 [internal function]: Guzzle\Http\Message\Request::onRequestError(Object(Guzzle\Common\Event), 'request.error', Object(Symfony\Component\EventDispatcher\EventDispatcher))

2 /home/aharmon/vendor/symfony/event-dispatcher/Symfony/Component/EventDispatcher/EventDispatcher.php(164): call_user_func(Array, Object(Guzzle\Common\Event), 'request.error', Object(Symfony\Component\EventDispatcher\EventDispatcher))

3 /home/aharmon/vendor/symfony/event-dispatcher/Symfony/Component/EventDispatcher/EventDispatch in /home/aharmon/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Connections/GuzzleConnection.php on line 238

Let me know if you have seen this before. I don't expect you to go out and find the answer for me I am just wondering if you have seen this before.

Thank you, Austin

Blackhawk2165 commented 9 years ago

3 line 238 is a throw for status code 500, the line reads:

throw new \Elasticsearch\Common\Exceptions\SeverErrorResponseException($responseBody, $statusCode, $exception);

2 line 165 is in a foreach loop and I believe is a function, the line reads:

call_user_func($listener, $event, $eventName, $this);

the foreach loop reads: foreach ($listener as $listener) {

thank you

Blackhawk2165 commented 9 years ago

So I just wanted to post an update so when you look at this you can see one more thing I've tried. I have downloaded and installed apache2. I thought since I was getting a status code 500 error it was something to do with that I didn't have it set up as a server yet. If that is the answer than I haven't set the right configurations or something because I'm still getting that error.

Blackhawk2165 commented 9 years ago

Hello,

I've been trying to figure out why these issue have been happening and haven't gotten anywhere. I tried using some of the syntax that you have on elastic search's site and i received the same errors. Is this an issue that you have seen before?

jrcastillo commented 7 years ago

@polyfractal sorry to comment after this is closed. i'm new to elasticsearch(ES) and I have trouble understanding bulk indexing. I'm trying to migrate the data from a mysql database to ES using the php bulk functions. When I run the php code, in the logs of ES this error appears:

[2017-03-14T11:07:16,464][DEBUG][o.e.a.b.TransportShardBulkAction] [fEYHv6b] [cooling_loads][0] failed to execute bulk item (index) index {[cooling_loads][cooling_loads][578fac4be138234254d30d25], source[_na_]}
org.elasticsearch.index.mapper.MapperParsingException: failed to parse
    at org.elasticsearch.index.mapper.DocumentParser.wrapInMapperParsingException(DocumentParser.java:175) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:69) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:275) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:533) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.index.shard.IndexShard.prepareIndexOnPrimary(IndexShard.java:510) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:196) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:201) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:348) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.bulk.TransportShardBulkAction.index(TransportShardBulkAction.java:155) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.bulk.TransportShardBulkAction.handleItem(TransportShardBulkAction.java:134) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.bulk.TransportShardBulkAction.onPrimaryShard(TransportShardBulkAction.java:120) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.bulk.TransportShardBulkAction.onPrimaryShard(TransportShardBulkAction.java:73) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.support.replication.TransportWriteAction.shardOperationOnPrimary(TransportWriteAction.java:76) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.support.replication.TransportWriteAction.shardOperationOnPrimary(TransportWriteAction.java:49) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:914) [elasticsearch-5.2.2.jar:5.2.2]

the code being used to bulk in the data is the following:

$client = setESClient();

  $tableRow = getTableData($conn, $tablename);
  $params = [
      'index' => $tablename,
      'type' => $tablename,
      'body' => []
    ];

  $counter = 0;

  foreach($tableRow as $row) {
    $params['body'][] = [
      'index' => [
        '_index' => $tablename,
        '_type' => $tablename,
        '_id' => $row['id']
      ]
    ];

    $params['body'][] = array($row);

    $counter = $counter + 1;

    if($counter == 10){
      $response = $client->bulk($params);
      printf('Processed bulk');
      $counter = 0;
      $params = ['body' => []];
      unset($response);
    }
  }

  $response = $client->bulk($params);
  return($response);

Also I must say that I have not defined any mapping for the incoming data. Should I define these mapping ? As the code before says, I create the index with the tablename while I finish understanding how to use ES.

Should I open a new issue ?

polyfractal commented 7 years ago

@jrcastillo If I had to guess, I'd say that one of your $row values is empty or null, which means the body for that document is null and throws off the parsing.

Definitely open a new issue if you need more help, I tend to not see comments on closed issues as easily due to how my notifications are structured. Sorry for the delay in answering!

mishakansal commented 5 years ago

hi, is there anyone who can help me. i have a .txt file containing json data. i need to loop through it in php and index bulk data in my elastic search . im new to elastic search.

cyrrill commented 3 years ago
 // Every 1000 documents stop and send the bulk request
 if ($i % 1000) {

This is totally incorrect! That will actually launch a bulk command every single iteration, except when $i /1000 = 0.

What you really want is:

 if ($i % 1000 === 0) {
mohamedhafezqo commented 3 years ago
$params = ['body' => []];

for ($i = 1; $i <= 1234567; $i++) {
    $params['body'][] = [
        'index' => [
            '_index' => 'my_index',
            '_id'    => $i
        ]
    ];

    $params['body'][] = [
        'my_field'     => 'my_value',
        'second_field' => 'some more values'
    ];

    // Every 1000 documents stop and send the bulk request
    if ($i % 1000 == 0) {
        $responses = $client->bulk($params);

        // erase the old bulk request
        $params = ['body' => []];

        // unset the bulk response when you are done to save memory
        unset($responses);
    }
}

// Send the last batch if it exists
if (!empty($params['body'])) {
    $responses = $client->bulk($params);
}

check doc