denshoproject / ddr-cmdln

Command-line tools for automating the Densho Digital Repository's various processes.
Other
0 stars 2 forks source link

ddrindex publish is not publishing segments #56

Closed gjost closed 6 years ago

gjost commented 6 years ago

UPDATE: Note that fix requires updating ddr-defs.

GeoffFroh commented 6 years ago

Some error output:

$ ddrindex publish --hosts XXXXXX:9200 --recurse --force /var/www/media/ddr/ddr-one-7
...
2018-03-19 14:24:37.826610-07:00 | 2812/4216 POST ddr-one-7-26-21
Traceback (most recent call last):
  File "/opt/ddr-local/venv/ddrlocal/bin/ddrindex", line 14, in <module>
    load_entry_point('ddr-cmdln==0.9.4b0', 'console_scripts', 'ddrindex')()
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/click-6.7-py2.7.egg/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/click-6.7-py2.7.egg/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/click-6.7-py2.7.egg/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/click-6.7-py2.7.egg/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/click-6.7-py2.7.egg/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/cli/ddrindex.py", line 287, in publish
    status = docstore.Docstore(hosts, index).post_multi(path, recursive=recurse, force=force)
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/docstore.py", line 666, in post_multi
    created = self.post(document, parents=parents, force=force)
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/docstore.py", line 560, in post
    ES_Class = ELASTICSEARCH_CLASSES_BY_MODEL[document.identifier.model]
KeyError: 'segment'
gjost commented 6 years ago

Added 'segment' to ELASTICSEARCH_CLASSES_BY_MODEL, pointing to Entity. That fixes the particular error above, but leads to another error.

docstore.Docstore.post_multi uses docstore.Docstore.postto posting an object to Elasticsearch. docstore.Docstore.post uses repo_models.elastic.Entity.Meta.doc_type as the document type, which is "entity". However, when docstore.Docstore.post_multi does a GET to see if it was saved successfully it uses object.identifier.model which in this case as "segment". This makes it look like the object was not written to Elasticsearch.

gjost commented 6 years ago

Using ELASTICSEARCH_CLASSES_BY_MODEL[oi.model]._doc_type.name solved that problem. On to the next one!

...
2018-03-21 17:22:56.965632-07:00 | 3099/4216 POST ddr-one-7-30                                                                                         
2018-03-21 17:22:56.986403-07:00 | 3100/4216 POST ddr-one-7-30-8                                                                                       
Traceback (most recent call last):                                                                                                                     
  File "/opt/ddr-local/venv/ddrlocal/bin/ddrindex", line 11, in <module>                                                                               
    load_entry_point('ddr-cmdln==0.9.4b0', 'console_scripts', 'ddrindex')()                                                                            
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/click-6.7-py2.7.egg/click/core.py", 
...
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/cli/ddrindex.py", line 287, in publish
    status = docstore.Docstore(hosts, index).post_multi(path, recursive=recurse, force=force)
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/docstore.py", line 655, in post_multi
    document = oi.object()
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/identifier.py", line 1072, in object
    return self.object_class(mappings).from_identifier(self)
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 1540, in from_identif$er
    return from_json(Entity, identifier.path_abs('json'), identifier)
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 382, in from_json
    document.load_json(fileio.read_text(json_path))
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 1639, in load_json
    json_data = load_json(self, module, json_text)
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 283, in load_json
    f.values()[0]
  File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/modules.py", line 112, in function
    value = function(value)
  File "/opt/ddr-local/ddr-defs/repo_models/segment.py", line 1080, in jsonload_topics
    converters.text_to_bracketids(text, ['term','id'])
  File "/opt/ddr-local/ddr-defs/repo_models/segment.py", line 1076, in TEMP_scrub_topicdata
    item['term'] = TEMP_this.TOPICS[item['id']]
KeyError: u'205'

This turned out to be some code not tolerant enough of bad data.

gjost commented 6 years ago

Partially fixed in commit #3847388. Also requires fixes from ddr-defs commits #abb6d89 and #e538992.

gjost commented 6 years ago

Gonna keep this open until it's tested and merged into master

pkikawa commented 6 years ago

all good on my end now!