cfpb / sheer

A tool for loading arbitrary content into Elasticsearch and serving that content on the web.
Creative Commons Zero v1.0 Universal
29 stars 23 forks source link

processors.json now explicitly respects ordering #28

Closed rosskarchner closed 10 years ago

rosskarchner commented 10 years ago

And there's a new IndexHelper which lets you get_document from processing code.

Use it like:

from sheer.processors.helpers import IndexHelper
index_helper = IndexHelper()

... later...

document = index_helper.get_document('some_type','some_document_id')
himedlooff commented 10 years ago

Pretty sweet, thank you! :heart_eyes_cat:

himedlooff commented 10 years ago

Hmmm, I'm getting the following error:

elasticsearch.exceptions.NotFoundError: TransportError(404, u'{"_index":"content","_type":"posts","_id":"spring-2014-rulemaking-agenda","found":false}')

In wordpress_view_procesor.py I added the following code:

from sheer.processors.helpers import IndexHelper
index_helper = IndexHelper()

Then inside of the process_view function...

...
    popular_posts = []
    for slug in custom_fields['popular_posts'][:5]:
        popular_posts.append(index_helper.get_document('posts',slug))
    post['popular_posts'] = popular_posts
...
rosskarchner commented 10 years ago

are you sure you have the most up-to-date processors.json? (ie, is "posts" listed before "views')

himedlooff commented 10 years ago

Yup:

{
  "posts" : {
    "url" :       "$WORDPRESS/?json=1",
    "processor" : "wordpress_post_processor",
    "mappings" :  "_defaults/posts_mappings.json"
  },
  "views" : {
    "url" :       "$WORDPRESS/api/get_recent_posts/?post_type=view",
    "processor" : "wordpress_view_processor"
  }
}

Here's the full error:

$ sheer index

creating mapping for views (wordpress_view_processor)

No handlers could be found for logger "elasticsearch"

Traceback (most recent call last):
  File "/Users/moricim/Projects/.virtualenvs/cfgov-refresh/bin/sheer", line 8, in <module>
    execfile(__file__)
  File "/Users/moricim/Projects/tools/sheer/sheer/scripts/sheer", line 60, in <module>
    args.func(args, config)
  File "/Users/moricim/Projects/tools/sheer/sheer/indexer.py", line 125, in index_location
    for i, document in enumerate(processor.documents()):
  File "/Users/moricim/Projects/himedlooff/cfw/cfgov-refresh/_lib/wordpress_view_processor.py", line 30, in documents
    yield process_view(view)
  File "/Users/moricim/Projects/himedlooff/cfw/cfgov-refresh/_lib/wordpress_view_processor.py", line 39, in process_view
    popular_posts.append(index_helper.get_document('views','blog'))
  File "/Users/moricim/Projects/tools/sheer/sheer/processors/helpers.py", line 21, in get_document
    doc_type=doctype, id=docid)
  File "/Users/moricim/Projects/.virtualenvs/cfgov-refresh/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 70, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/Users/moricim/Projects/.virtualenvs/cfgov-refresh/lib/python2.7/site-packages/elasticsearch/client/__init__.py", line 228, in get
    params=params)
  File "/Users/moricim/Projects/.virtualenvs/cfgov-refresh/lib/python2.7/site-packages/elasticsearch/transport.py", line 223, in perform_request
    status, raw_data = connection.perform_request(method, url, params, body, ignore=ignore)
  File "/Users/moricim/Projects/.virtualenvs/cfgov-refresh/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 53, in perform_request
    self._raise_error(response.status, raw_data)
  File "/Users/moricim/Projects/.virtualenvs/cfgov-refresh/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 82, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.NotFoundError: TransportError(404, u'{"_index":"content","_type":"views","_id":"blog","found":false}')
rosskarchner commented 10 years ago

What are you trying to accomplish with:

popular_posts.append(index_helper.get_document('views','blog'))

That will always fail within the view processor, because the blog view hasn't been saved to elasticsearch yet

So now that 'posts' load before 'views', you can't do this from the posts processor:

himedlooff commented 10 years ago

But I'm not doing this:

popular_posts.append(index_helper.get_document('views','blog')).

I'm trying to get posts with get_document from wordpress_view_procesor.py like this: popular_posts.append(index_helper.get_document('posts',slug)).

rosskarchner commented 10 years ago

The traceback includes that line

  File "/Users/moricim/Projects/himedlooff/cfw/cfgov-refresh/_lib/wordpress_view_processor.py", line 39, in process_view
    popular_posts.append(index_helper.get_document('views','blog'))

Are you sure it's not there? ;)

rosskarchner commented 10 years ago

OK, I see in the post from a few hours ago that it was failing to look up a post:

https://github.com/cfpb/sheer/pull/28#issuecomment-47722090

It almost seems like, when you run sheer index, it isn't respecting the new order defined in processors.json-- it's processing the view first, and thus get_document('posts','whatever') fails because there are no posts in elasticsearch yet.

himedlooff commented 10 years ago

Ok, my apologies. I did at one point try popular_posts.append(index_helper.get_document('views','blog')) after popular_posts.append(index_helper.get_document('posts',slug)) didn't work. I figured trying a view in the view processor should work. So that explains the discrepancy, again sorry!

Here's the error when running popular_posts.append(index_helper.get_document('posts',slug)):

$ sheer index

creating mapping for views (wordpress_view_processor)

No handlers could be found for logger "elasticsearch"

Traceback (most recent call last):
  File "/Users/moricim/Projects/.virtualenvs/cfgov-refresh/bin/sheer", line 8, in <module>
    execfile(__file__)
  File "/Users/moricim/Projects/tools/sheer/sheer/scripts/sheer", line 60, in <module>
    args.func(args, config)
  File "/Users/moricim/Projects/tools/sheer/sheer/indexer.py", line 125, in index_location
    for i, document in enumerate(processor.documents()):
  File "/Users/moricim/Projects/himedlooff/cfw/cfgov-refresh/_lib/wordpress_view_processor.py", line 33, in documents
    yield process_view(view)
  File "/Users/moricim/Projects/himedlooff/cfw/cfgov-refresh/_lib/wordpress_view_processor.py", line 42, in process_view
    popular_posts.append(index_helper.get_document('posts',slug))
  File "/Users/moricim/Projects/tools/sheer/sheer/processors/helpers.py", line 21, in get_document
    doc_type=doctype, id=docid)
  File "/Users/moricim/Projects/.virtualenvs/cfgov-refresh/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 70, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/Users/moricim/Projects/.virtualenvs/cfgov-refresh/lib/python2.7/site-packages/elasticsearch/client/__init__.py", line 228, in get
    params=params)
  File "/Users/moricim/Projects/.virtualenvs/cfgov-refresh/lib/python2.7/site-packages/elasticsearch/transport.py", line 223, in perform_request
    status, raw_data = connection.perform_request(method, url, params, body, ignore=ignore)
  File "/Users/moricim/Projects/.virtualenvs/cfgov-refresh/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 53, in perform_request
    self._raise_error(response.status, raw_data)
  File "/Users/moricim/Projects/.virtualenvs/cfgov-refresh/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 82, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)

elasticsearch.exceptions.NotFoundError: TransportError(404, u'{"_index":"content","_type":"posts","_id":"spring-2014-rulemaking-agenda","found":false}')
rosskarchner commented 10 years ago

Is the problematic version of cfgov-refresh commited anywhere? I'd like to checkout your code and see if I get the same problem

himedlooff commented 10 years ago

I'll create a fork for you

himedlooff commented 10 years ago

https://github.com/himedlooff/cfgov-refresh/tree/getdoc

Here's the commit with the code I added: https://github.com/himedlooff/cfgov-refresh/commit/d7d18431b9a34a7acc3a4b5bf9c2e0182252f771

rosskarchner commented 10 years ago

Awesome, I'll give it a shot.