aws / aws-sdk-php

Official repository of the AWS SDK for PHP (@awsforphp)
http://aws.amazon.com/sdkforphp
Apache License 2.0
6k stars 1.22k forks source link

Support for adding documents in CloudSearch #196

Closed greggilbert closed 10 years ago

greggilbert commented 10 years ago

I can't tell if I've completely missed it in the documentation, but it doesn't seem like it's possible to use the SDK to add documents. boto has support for this, but none of the other SDKs. Can this be added?

skyzyx commented 10 years ago

What's a document?

greggilbert commented 10 years ago

It's the basic data block in CloudSearch. Basically I think the SDK should support being able to send in data.

skyzyx commented 10 years ago

Ah, CloudSearch. You didn't lead with that. :)

skyzyx commented 10 years ago

:+1:

jeremeamia commented 10 years ago

The SDK does not fully support the CloudSearch document or search APIs. There are some technical reasons why we don't at the moment, but I'd like to see us support it more fully as well. I'll mark this as a feature request and bring it up with the team.

However, you should know that the document API is very simple. It requires only a basic HTTP client, like Guzzle, which is included with the SDK. Requests to your document endpoint do not need to be signed, since they run on your own AWS infrastructure provisioned by CloudSearch, and are restricted by IP.

Take a look at the following links to help you get started:

Using this information, I came up with a short code example of how you can upload documents. Please keep in mind that I have not tested this, so it may need a little tweaking. If you cannot get it to work, please let me know and I will look into it further.

// Setup a CloudSearch Document client
$endpoint = 'http://doc-{domain_name}-{domain_id}.us-east-1.cloudsearch.amazonaws.com/{api_version}';
$client = new \Guzzle\Http\Client($endpoint, array(
    'domain_name' => 'YOUR_CLOUDSEARCH_DOMAIN_NAME',
    'domain_id'   => 'YOUR_CLOUDSEARCH_DOMAIN_ID',
    'api_version' => '2011-02-01',
));

// Upload documents
$request = $client->post('documents/batch');
$request->setBody('[{"your":"documents"},{"and":"data"}]', 'application/json');
// OR: $request->setBody(fopen('/path/to/your/documents', 'r'), 'application/json');
$result = $request->send()->json();

And, of course, you will want to make sure that your documents are prepared in the correct format for CloudSearch.

skyzyx commented 10 years ago

Yeah, I have some experience here. Making the requests is easy. Putting together an .sdf document for the content you want to index could certainly use some helpers. For example, the public documentation explains that since the service uses XML serialization on the backend, even if you create a JSON-based .sdf file, its contents are still required to follow the (stricter) XML serialization rules. I learned that one the hard way.

The old-school CloudSearch CLI Tools (as opposed to the new Unified AWS CLI Tools) have some nice convenience actions for generating .sdf indexes from a set of files you want to index. That may be helpful for some people. I ended up building .sdf creation directly into my View layer so that I could automate the content indexing easier.

There are a couple of other areas where the CloudSearch search API could be smoothed over. For example, getting the total number of search results while focusing on a single facet requires two requests. This can be batched using Guzzle directly, but having a PHP interface to smooth out the rough edges and add some convenience would be a big benefit for end-users.

It's not just making the raw request; it's about maintaining the same programmatic interface that we're already used to, and making it easier to work with the service responses.

mtdowling commented 10 years ago

I'm moving this feature request to our team's internal backlog where we can track it and prioritize it more effectively.

gregholland commented 10 years ago

I'd love to see CloudSearch document and search supported by the SDK. Although it's easy enough to roll your own using guzzle, some helpers for building up queries and a nicer way to handle errors would be great. (guzzle returns 400 bad request without the actual error from CS making trouble shooting a pain)

I'd also love to use my keys to make search/document requests rather than relying on i.p restriction policies, but I'm assuming that is more of an issue for the CloudSearch team and is possibly the main reason why there is no search/document functionality in the SDK.

paulstatezny commented 10 years ago

Completely agree with @gregholland. Both features would be incredibly helpful.

jeremeamia commented 10 years ago

@greggilbert @gregholland @paulstatezny @skyzyx The SDK has support for searching and uploading documents via the new CloudSearchDomainClient as of version 2.6.9.

gregholland commented 10 years ago

Good stuff, thanks! Do you know if the CS team plan on implementing IAM for the search and document endpoints?

jeremeamia commented 10 years ago

I have no idea. That would be a question for the CloudSearch forum. :smile:

greggilbert commented 10 years ago

Ha, that's awesome. Thanks!