bitemyapp / bloodhound

Haskell Elasticsearch client and query DSL
bitemyapp.com
BSD 3-Clause "New" or "Revised" License
424 stars 118 forks source link

Index new documents using ElasticSearch POST API #107

Open maaronking opened 8 years ago

maaronking commented 8 years ago

indexDocument currently requires that the user provide a valid DocId. However, ElasticSearch provides a facility to auto-generate document ID's using the POST request described on this page.

It would be nice if indexDocument perhaps took a Maybe DocId. However, being pretty new to Haskell, I haven't quite figured out how to get that to work. In the meantime I just added indexDocument' to my code (really hacky, but works for the time being).

indexDocument' :: (ToJSON doc, MonadBH m) => IndexName -> MappingName
                 -> IndexDocumentSettings -> doc -> m Reply
indexDocument' (IndexName indexName)
  (MappingName mappingName) cfg document =
  bindM2 post url (return body)
  where url = addQuery params <$> joinPath [indexName, mappingName]
        versionCtlParams = case idsVersionControl cfg of
          NoVersionControl -> []
          InternalVersion v -> versionParams v "internal"
          ExternalGT (ExternalDocVersion v) -> versionParams v "external_gt"
          ExternalGTE (ExternalDocVersion v) -> versionParams v "external_gte"
          ForceVersion (ExternalDocVersion v) -> versionParams v "force"
        vt = T.pack . show . docVersionNumber
        versionParams v t = [ ("version", Just $ vt v)
                            , ("version_type", Just t)
                            ]
        parentParams = [] -- case idsParent cfg of
          --Nothing -> []
          --Just (DocumentParent (DocId p)) -> [ ("parent", Just p) ]
        params = versionCtlParams ++ parentParams
        body = Just (encode document)

Obviously there are much better ways to do this. I apologize for posting a request without a reasonable possible solution.

Thanks for your work on this library!

bitemyapp commented 8 years ago

First off, I just want to thank you for documenting this and for writing the code. I do appreciate that a lot, you went to more trouble than most people by far and I love that.

With that said, while it probably seems like an accident, this functionality gap is intentional as I don't think it's a good idea to rely on their document id generation.

You cannot trust Elasticsearch with your data. If you do and you don't use a doc id scheme like UUIDv4 with some other natural key to identify unique data, you're not going to be able to recover from a split brain merge. You'll lose data.

I haven't gone out of my way to prevent people using the ES provided doc ids, but I wasn't interested in facilitating it either.

maaronking commented 8 years ago

@bitemyapp As per your suggestion, I used a haskell UUID library in conjunction with the Bloodhound indexDocument function and got the behavior I was looking for (a much better solution than what I was proposing). Thank you very much for your suggestion.

I am using ES more because it has been imposed upon me, and as such, I admit that I am still learning how to use it properly. I had no idea about the flaw in the ES automatic DocID generation. Thanks for pointing it out. If you think it appropriate, it may be a good idea to point this out in the documentation for the indexDocument function as the solution was, at least in my case, non obvious. I am more than happy to submit a pull request for that if it is helpful.

As this issue was really more about a gap in my own knowledge rather than something to fix in the Bloodhound library, it is probably safe to close the issue. However, I am not sure of the etiquette, should I close the issue or leave that to you?

Thanks again!

bitemyapp commented 8 years ago

@maaronking Put up the PR please :)

Mention this issue and I'll close it when the PR is merged.