CDRH / api

Codenamed "Apium": An API to access all public Center for Digital Research in the Humanities resources
https://cdrhdev1.unl.edu/api_frontend
MIT License
3 stars 1 forks source link

look at special characters, spaces, etc making through system #23

Closed jduss4 closed 6 years ago

jduss4 commented 7 years ago

the api_template should be able to send titles and whatnot with special characters / space through api_bridge to the api and then to elasticsearch. So far this hasn't been tested at all, write some tests for the repos and give it a try via ui

jduss4 commented 6 years ago

This misleading section of their docs makes it look like I should be able to escape a special character, but alas, everything is terrible.

Okay, so locally, in an XML file, I added:

& Jessica!: the coolest

During the XML -> JSON step in the data repo, the ampersand will be decoded.

Now I'm trying to query it from the command line with:

curl 'http://localhost:9200/test1/_search?pretty' -d '{
      "aggs": {
      },
      "from": 0,
      "highlight": {
        "fields": {
          "text": {
            "fragment_size": 100,
            "number_of_fragments": 3
          }
        }
      },
      "size": 20,
      "query": {
        "bool": {
          "must": {
            "query_string": {
              "default_field": "text",
              "query": "Jessica!:"
            }
          },
          "filter": [
            {
              "term": {
                "collection": "example"
              }
            }
          ]
        }
      },
      "sort": [
        "_score"
      ]
    }
  }
'

It is not friggin working. Variations on the theme include:

"\"Jessica\!\""

"Jessica\!"

and several other versions. Two of the most common errors are:

"Unrecognized character escape '!' (code 33)\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@479560cb; line: 19, column: 36]"
"reason" : "Cannot parse 'Jessica:': Encountered \"<EOF>\" at line 1, column 8.\nWas expecting one of:\n    <BAREOPER> ...\n    \"(\" ...\n    \"*\" ...\n    <QUOTED> ...\n    <TERM> ...\n    <PREFIXTERM> ...\n    <WILDTERM> ...\n    <REGEXPTERM> ...\n    \"[\" ...\n    \"{\" ...\n    <NUMBER> ...\n    ",
jduss4 commented 6 years ago

@techgique figured out that double escaping the \ helps move things long, but queries are still not hitting on the special characters themselves

"query_string": {
              "default_field": "text",
              "query": "\"\\& Jessica!: coolest\""
            }
" Roy Ellis, President State Teachers College Springfield, Missouri Dear President Ellis & <em>Jessica</em>",
            "!: <em>coolest</em> person My two daughters, Hilda and Enid, are planning to enter State Teachers College this"
jduss4 commented 6 years ago

Changing the highlighter to span didn't appear to help, though we may want to follow up on adding span as the default highlighting response:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#_fragmenter

jduss4 commented 6 years ago

It would appear that elasticsearch does not take characters like : into account when looking for search results, just as the highlights do not.

jduss4 commented 6 years ago

Example!

For a bit, I was separating Cather person.name additional names with a semicolon, but then I realized that while the browse view displays them correctly, the API isn't returning results for documents with that person.name. I thought it might be catching on the orchid / api_bridge side of things, but then when I actually tried manually in the API, it also didn't work:

Miner, Charles Hugh (Hugh; Hughie)

collection/cather/items?f[]=person.name|Miner,%20Charles%20Hugh%20(Hugh;%20Hughie)

jduss4 commented 6 years ago

This took care of quotation marks around text searches: https://github.com/CDRH/api/pull/65

jduss4 commented 6 years ago

I am going to close this issue in favor of two issues: