Closed Blackhawk2165 closed 7 years ago
You've mixed index/mapping creation with bulk indexing. That syntax won't work. First you need to create the index with mappings, then start bulk indexing into the newly created index.
Non-tested code, but something like this is more what you are looking for:
require 'vendor/autoload.php';
$client = new Elasticsearch\Client();
// Create the index and mappings
$mapping['index'] = 'rvuehistoricaldocuments2009-2013'; //mapping code
$mapping['body'] = array(
'mappings' => array(
'documents' => array(
'_source' => array(
'enabled' => true
),
'properties' => array(
'doc_name' => array(
'type' => 'string',
'analyzer' => 'standard'
),
'description' => array(
'type' => 'string'
)
)
)
)
);
$client->indices()->create($mapping);
// Now index the documents
for ($i = 0; $i <= 10000; $i++) {
$params ['body'][] = array(
'index' => array(
'index' => 'rvuehistoricaldocuments2009-2013',
'type' => 'documents',
'_id' => $i,
'body' => array(
'foo' => 'bar'// Document body goes here
)
)
);
// Every 1000 documents stop and send the bulk request
if ($i % 1000) {
$responses = $client->bulk($params);
// erase the old bulk request
$params = array();
// unset the bulk response when you are done to save memory
unset($responses);
}
}
Okay that makes more sense. I found a tutorial on indexing and then I integrated the mapping so I must have put it together wrong. I have one more question for now. How does elastic search know where the directories are that have the files? I have some data saved on the machine I have elastic search running on so I was going to index those documents first and then index documents on a large multi-terabyte external drive. Where do I specify where to look for the documents.
Thanks for your help, Austin Harmon
No problem, happy to help :)
How does elastic search know where the directories are that have the files
Elasticsearch doesn't know where the files are at all. You have to write the "plumbing" code that imports data into Elasticsearch. So you'll just need to tell your PHP script where to load the data (e.g. file_get_contents()
or similar), then start constructing bulk requests.
ES doesn't have any "import" functionality or anything. It all has to be inserted via your code.
So before I create the index and mapping I need to put that function or something similar with the file path in order to index properly? Also I have a directory that has a directory that has a bunch of directories and then files. Should I put the path to each directory with files in it individually or can I just stop at the directory that has all the different directories in it?
Also I wanted to ask you about JSON encoding I noticed you got rid of the json_encode() function. Why did you delete that?
You'll have to load all the data yourself, which would include recursively loading directories to find files if that's how your data is stored. E.g open a file, then parse it into an array and index:
$documentData = file_get_contents('/path/to/document/data.json');
// If the data is json, you can decode it
$documentData = json_decode($documentData);
// Or if the data was a csv, maybe split by line
//$documentData = explode("\n", $documentData);
// etc etc. Depends on what format your input data is
// Now index the documents
for ($i = 0; $i <= count($documentData); $i++) {
$params ['body'][] = array(
'index' => array(
'index' => 'rvuehistoricaldocuments2009-2013',
'type' => 'documents',
'_id' => $i,
'body' => array(
$documentData[$i]
)
)
);
// Every 1000 documents stop and send the bulk request
if ($i % 1000) {
$responses = $client->bulk($params);
// erase the old bulk request
$params = array();
// unset the bulk response when you are done to save memory
unset($responses);
}
}
Also I wanted to ask you about JSON encoding I noticed you got rid of the json_encode() function. Why did you delete that?
The client will automatically serialize PHP arrays into valid JSON for you. So you just need to provide a PHP array of the data you want to index.
Some API endpoints, like the Bulk API, have special JSON syntax, which is why the client handles it for you.
So I have a bunch of documents in all different formats like .docx, .csv, .ppt, .pdf, etc. Do I have to get them all in one format or can I just put them into an array and index?
So can I index an entire directory as long as there are only files in it? or do I have to index each document separately?
Another question for you. Can the file names have spaces in them? I know if I want to cd into the directory with linux they can't but when indexing them and putting them in the file path in the php script does that matter or do I need to eliminate all spaces?
So I have a bunch of documents in all different formats like .docx, .csv, .ppt, .pdf, etc. Do I have to get them all in one format or can I just put them into an array and index? So can I index an entire directory as long as there are only files in it? or do I have to index each document separately?
Elasticsearch has no concept of files, directories, folders, disk drives, etc.
Elasticsearch only understands JSON formatting. So you will need to load those documents and somehow parse/transform them into JSON documents containing simple field : value
pairs. It's just like with a database: you can't insert a .docx
into MySQL...you have to insert a row which contains columns and values. With MySQL, you would first transform your data into some kind of row representation before inserting.
So with ES, you have to load/transform data and insert it as JSON.
There is an attachment plugin for Elasticsearch which might help you, although it can be limiting at times.
I'd recommend sitting down the the Elasticsearch - Definitive Guide book and getting to know ES a little bit better. It sounds like there are some fundamental concepts that you should learn first before moving forward...it'll make the whole experience a lot better if you have some solid fundamentals about how ES operates.
So after doing a lot of reading and studying up, if I only wanted one node with one shard and also only want one index (rvuehistoricaldocuments2009-2013) with one type (documents) then would it be easier to put all the files I want to index under one directory, since they will all be the same type I can just run a loop with auto generated id's and index the documents that way.
Yeah, I think that would probably be the simplest way to do it. Otherwise you'll have to mess around with recursive directory scanning, which isn't the most pleasant thing to do in PHP :)
how does this look:
<?php
require 'vendor/autoload.php';
$client = new Elasticsearch/Client();
//Create the index and mappings $mapping['index'] = 'rvuehistoricaldocuments2009-2013'; //mapping code $mapping['body'] = array ( 'mappings' => array ( 'documents' => array ( '_source' => array ( 'enabled' => true ), 'properties' => array( 'doc_name' => array( 'type' => 'string', 'analyzer' => 'standard' ), 'description' => array( 'type' => 'string' ) ) ) ) );
$client ->indices()->create($mapping)
$documentData = file_get_contents('~/elkdata/for_elk_test_2014_11_24/Documents'
//Now index the documents
for ($i = 0; $i <= count($documentData); $i++) { $params ['body'][] = array( 'index' => array( 'type' => 'documents' 'body' => array( 'foo' => 'bar' //Document body goes here
)
)
);
//Every 1000 documents stop and send the bulk request.
if($1 % 1000) {
$responses = $client->bulk($params);
// erase the old bulk request
$params = array();
// unset the bulk response when you are done to save memory
unset($responses);
}
} ?>
I didn't do any json_encode because when I go to index the documents I am putting them in a array which puts them in json format for me correct?
Thanks again for helping me out, I'm new to php and elastic search so you have been a tremendous help!
So it turns out that there is way to much data to just shove it all in one directory. I thought it would be a simple solution, but I got denied :( anyways I've heard that these tasks are easier to do in perl. Can I use perl to index everything and then the php client to write the rest or does mixing clients not work?
Have you seen this function before? http://php.net/manual/en/function.scandir.php
Can I use perl to index everything and then the php client to write the rest or does mixing clients not work?
Sure. There is also an Elasticsearch client for Perl. The syntax is a little different (more perl-y) but should be relatively similar. You could load up all the docs into elasticsearch using the Perl client, then use PHP to run queries, etc. There are also clients for Java, .Net, Groovy, Ruby, Python and Javascript, depending on your preference :)
At the end of the day, all the clients are just building HTTP requests to send to the server, so they are all interchangeable really.
Have you seen this function before? http://php.net/manual/en/function.scandir.php
Yep, scandir is one way to do a directory search. You could also use dir/openDir or globs (discussed here). I would probably use a RecursiveIteratorIterator
approach, as detailed here:
http://stackoverflow.com/a/14305746 and http://stackoverflow.com/a/2398163
you are awesome man, thanks for everything. You have been a huge help. I will let you know if I have anymore issues.
Hello,
I've added recursive code into my file. Should I include all of this into one file or should I split them up?
Here is my code:
<?php
require 'vendor/autoload.php';
$client = new Elasticsearch/Client();
$root = realpath('~/elkdata/for_elk_test_2014_11_24/Agencies');
$iter = new RecursiveIteratorIterator( new RecursiveDirectoryIterator($root, RecursiveDirectoryIterator::SKIP_DOTS), RecursiveIteratorIterator::SELF_FIRST, RecursiveIteratorIterator::CATCH_GET_CHILD);
$paths = array($root); foreach ($iter as $path => $dir) { if ($dir -> isDir()) { $paths[] = $path; } }
//Create the index and mappings $mapping['index'] = 'rvuehistoricaldocuments2009-2013'; //mapping code $mapping['body'] = array ( 'mappings' => array ( 'documents' => array ( '_source' => array ( 'enabled' => true ), 'properties' => array( 'doc_name' => array( 'type' => 'string', 'analyzer' => 'standard' ), 'description' => array( 'type' => 'string' ) ) ) ) );
$client ->indices()->create($mapping)
//Now index the documents
for ($i = 0; $i <= count($paths); $i++) { $params ['body'][] = array( 'index' => array( 'type' => 'documents' 'body' => array( 'foo' => 'bar' //Document body goes here
)
)
);
//Every 1000 documents stop and send the bulk request.
if($1 % 1000) {
$responses = $client->bulk($params);
// erase the old bulk request
$params = array();
// unset the bulk response when you are done to save memory
unset($responses);
}
} ?>
Sorry I keep just posting my code int he text editor. I don't have an option for code.
Thanks again, Austin Harmon
Hello,
I ran into an issue with the Elasticsearch/Client() class being found. The error that comes up is:
PHP Fatal error: Class 'Elasticsearch' not found in home/aharmon/php-files/newindex.php on line 5
my line 5 looks like this: $client = new Elasticsearch/Client();
do I need to have another require or include so that the class is recognized?
Yep, you need to include Composer's autoloader (which then loads the various classes on-demand):
require 'vendor/autoload.php';
$client = new Elasticsearch/Client();
Thats what I have,
1 <?php 2 3 require '/home/aharmon/vendor/autoload.php'; 4 5 $client = new Elasticsearch/Client();
then I go into the recursive code.
I've been looking up possible solutions and I see that some people are having this issue when APC is enabled. Is this relevant?
Okay so I think my issue is that when I run composer.phar I get an error.
I set up my composer.json file which looks like this:
{ "require": { "elasticsearch/elasticseearch": "1.3.2" } }
I originally had the ~1.0 but then it was just giving me errors for every version of elastic search so i narrowed it down to the version i downloaded.
The problem that occurs is:
is the problem that I am missing the curl extension for PHP.
Ah, so it probably didn't install ES-PHP at all then, since it couldn't satisfy the curl extension requirement.
Yeah, the php libcurl extension is required for the client to work. Curl is the HTTP transport that PHP uses to send requests to the server. You'll need to make sure your php installation has the extension installed (either compiled in, or loaded as a dynamic extension).
Do I need to put the elastic search directory name in the autoload_namespaces.php file that comes with composer or will that take care of itself once I install php5-curl and run composer.phar?
Okay so I got php5-curl installed there was some weird error with the version of php-common that I had downloaded. Anyways I am still getting the same error.
PHP Fatal error: Class 'Elasticsearch' not found in /home/aharmon/php-files/newindex.php on line 5
line 5 is this:
$client = new Elasticsearch/Client();
when I ran the composer.phar file it showed the composer.json file with the elastic search information in it being processed into the autoload.
I shouldn't need to re-install elastic search and everything now that I have php5-curl installed right?
So I've done some digging into the directories now that I got composer to install. For my composer.json file do I need the whole plath to Client.php?
so instead of:
require: elasticsearch/elasticsearch
should I have: /vendor/elasticsearch/elasticsearch/src/Elasticsearch/Client.php
or am I looking into the wrong things?
Hello,
So I got everything working working on a new machine. It turns out the machine I was working on had faulty hardware haha.
Now I am just going through my php script to get it to work.
I got a strange error though. I know I've been posting a lot with code questions, but this error has something to do with the files that the composer installed.
Here is the error: PHP Fatal error: Uncaught exception 'Guzzle\Http\Exception\ServerErrorResponseException' with message 'Server error resonse [status code] 500 [reason phrase] Internal Server Error [url] http://localhost:9200/_all/_bulk' in /home/aharmon/vendor/guzzle/http/Guzzle/Http/Exception/BadResponseException.php: 43 Stack trace:
Let me know if you have seen this before. I don't expect you to go out and find the answer for me I am just wondering if you have seen this before.
Thank you, Austin
throw new \Elasticsearch\Common\Exceptions\SeverErrorResponseException($responseBody, $statusCode, $exception);
call_user_func($listener, $event, $eventName, $this);
the foreach loop reads: foreach ($listener as $listener) {
thank you
So I just wanted to post an update so when you look at this you can see one more thing I've tried. I have downloaded and installed apache2. I thought since I was getting a status code 500 error it was something to do with that I didn't have it set up as a server yet. If that is the answer than I haven't set the right configurations or something because I'm still getting that error.
Hello,
I've been trying to figure out why these issue have been happening and haven't gotten anywhere. I tried using some of the syntax that you have on elastic search's site and i received the same errors. Is this an issue that you have seen before?
@polyfractal sorry to comment after this is closed. i'm new to elasticsearch(ES) and I have trouble understanding bulk indexing. I'm trying to migrate the data from a mysql database to ES using the php bulk functions. When I run the php code, in the logs of ES this error appears:
[2017-03-14T11:07:16,464][DEBUG][o.e.a.b.TransportShardBulkAction] [fEYHv6b] [cooling_loads][0] failed to execute bulk item (index) index {[cooling_loads][cooling_loads][578fac4be138234254d30d25], source[_na_]}
org.elasticsearch.index.mapper.MapperParsingException: failed to parse
at org.elasticsearch.index.mapper.DocumentParser.wrapInMapperParsingException(DocumentParser.java:175) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:69) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:275) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:533) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.index.shard.IndexShard.prepareIndexOnPrimary(IndexShard.java:510) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.index.TransportIndexAction.prepareIndexOperationOnPrimary(TransportIndexAction.java:196) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:201) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:348) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.index(TransportShardBulkAction.java:155) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.handleItem(TransportShardBulkAction.java:134) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.onPrimaryShard(TransportShardBulkAction.java:120) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.onPrimaryShard(TransportShardBulkAction.java:73) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.support.replication.TransportWriteAction.shardOperationOnPrimary(TransportWriteAction.java:76) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.support.replication.TransportWriteAction.shardOperationOnPrimary(TransportWriteAction.java:49) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:914) [elasticsearch-5.2.2.jar:5.2.2]
the code being used to bulk in the data is the following:
$client = setESClient();
$tableRow = getTableData($conn, $tablename);
$params = [
'index' => $tablename,
'type' => $tablename,
'body' => []
];
$counter = 0;
foreach($tableRow as $row) {
$params['body'][] = [
'index' => [
'_index' => $tablename,
'_type' => $tablename,
'_id' => $row['id']
]
];
$params['body'][] = array($row);
$counter = $counter + 1;
if($counter == 10){
$response = $client->bulk($params);
printf('Processed bulk');
$counter = 0;
$params = ['body' => []];
unset($response);
}
}
$response = $client->bulk($params);
return($response);
Also I must say that I have not defined any mapping for the incoming data. Should I define these mapping ? As the code before says, I create the index with the tablename while I finish understanding how to use ES.
Should I open a new issue ?
@jrcastillo If I had to guess, I'd say that one of your $row
values is empty or null, which means the body for that document is null and throws off the parsing.
Definitely open a new issue if you need more help, I tend to not see comments on closed issues as easily due to how my notifications are structured. Sorry for the delay in answering!
hi, is there anyone who can help me. i have a .txt file containing json data. i need to loop through it in php and index bulk data in my elastic search . im new to elastic search.
// Every 1000 documents stop and send the bulk request
if ($i % 1000) {
This is totally incorrect! That will actually launch a bulk command every single iteration, except when $i /1000 = 0.
What you really want is:
if ($i % 1000 === 0) {
$params = ['body' => []];
for ($i = 1; $i <= 1234567; $i++) {
$params['body'][] = [
'index' => [
'_index' => 'my_index',
'_id' => $i
]
];
$params['body'][] = [
'my_field' => 'my_value',
'second_field' => 'some more values'
];
// Every 1000 documents stop and send the bulk request
if ($i % 1000 == 0) {
$responses = $client->bulk($params);
// erase the old bulk request
$params = ['body' => []];
// unset the bulk response when you are done to save memory
unset($responses);
}
}
// Send the last batch if it exists
if (!empty($params['body'])) {
$responses = $client->bulk($params);
}
Hello,
My name is Austin Harmon and I am new to elastic search. I am looking to index a couple hundred thousand documents with elastic search and I would like to use the php client to do it. I have my index set up with one shard and one replica since I have a smaller amount of documents. I have looked over all the syntax on the elastic search site and github. This is what my index.php file looks like:
<?php require 'vendor/autoload.php';
$client = new Elasticsearch\Client(); $indexParams['index'] = 'rvuehistoricaldocuments2009-2013'; //mapping code
?>
I'm not sure if I have everything I need or if I'm doing this right so if you could let me know if this looks correct that would be very helpful.
Thank you, Austin Harmon