denshoproject / ddr-cmdln

Command-line tools for automating the Densho Digital Repository's various processes.
Other
0 stars 2 forks source link

ddrindex not indexing av/visual history correctly #186

Closed sarabeckman closed 3 years ago

sarabeckman commented 4 years ago

Indexed ddr-chi-1 and ddr-densho-400 to ddrstage. Neither are displaying correctly.

For ddr-chi-1 - the video download links are not working, the player is the old player.

For ddr-densho-400 the type is an AV object that is audio only. It should appear as this CSUJAD interview on production http://ddr.densho.org/ddr-csujad-9-1/. I indexed ddr-csujad-9 to ddrstage to compare and it is not displaying.

For comparison, here is a good doc from the production ES cluster (indexed at 2019-03-11T11:55:53):

GOODddr-csujad-9-1-ESdoc.json.txt

And here is a bad doc from the stage ES cluster that was indexed with ddr-cmdln v5.0.4 on master:

BADddr-csujad-9-1-ESdoc.json.txt

This behavior is also causing ddr-public to use the incorrect version of the av templates (i.e., the old segment template that uses the deprecated embedded IA player). Here is a bad doc from the production ES index for ddr-chi-1-1:

BADddr-chi-1-1-1-ESdoc.json.txt

GeoffFroh commented 4 years ago

This is the same behavior reported in https://github.com/denshoproject/ddr-cmdln/issues/155

Note that we confirmed that the problem entities are present at IA and have the proper metadata (see: https://github.com/denshoproject/ddr-cmdln/issues/123)

GeoffFroh commented 4 years ago

Confirmed that archivedotorg.py is apparently functioning:

>>> from DDR import archivedotorg, models, identifier
>>> e = models.Entity.from_identifier(Identifier("ddr-chi-1-1-1","/media/qnfs/kinkura/gold/"))
>>> iameta = archivedotorg.get_ia_meta(e)
>>> iameta
{'id': 'ddr-chi-1-1-1', 'xml_url': 'https://archive.org/download/ddr-chi-1-1-1/ddr-chi-1-1-1_files.xml', 
'http_status': 200, 'original': 'ddr-chi-1-1-1-mezzanine-87bc67df89.mpg', 'mimetype': 
'video/mpeg', 'files': {'mp3': {'name': 'ddr-chi-1-1-1-mezzanine-87bc67df89.mp3', 'format': 'mp3', 
'url': 'https://archive.org/download/ddr-chi-1-1-1/ddr-chi-1-1-1-mezzanine-87bc67df89.mp3', 
'mimetype': 'audio/mpeg', 'encoding': None, 'sha1': 
'd91cb8611509815d52db60381b7375db2260620d', 'size': '1744963', 'length': '145.15', 'height': 
'0', 'width': '0', 'title': ''}, 'mp4': {'name': 'ddr-chi-1-1-1-mezzanine-87bc67df89.mp4', 'format': 
'mp4', 'url': 'https://archive.org/download/ddr-chi-1-1-1/ddr-chi-1-1-1-mezzanine-
87bc67df89.mp4', 'mimetype': 'video/mp4', 'encoding': None, 'sha1': 
'21a31c0e996d9da39c7c275504262ab433533a9d', 'size': '15178544', 'length': '145.15', 'height': 
'480', 'width': '853', 'title': ''}, 'mpg': {'name': 'ddr-chi-1-1-1-mezzanine-87bc67df89.mpg', 'format': 
'mpg', 'url': 'https://archive.org/download/ddr-chi-1-1-1/ddr-chi-1-1-1-mezzanine-
87bc67df89.mpg', 'mimetype': 'video/mpeg', 'encoding': None, 'sha1': 
'87bc67df89c2f45a0e1555c52afab8bf1fa433f8', 'size': '653512708', 'length': '145.13', 'height': 
'1080', 'width': '1920', 'title': ''}, 'png': {'name': 'ddr-chi-1-1-1-mezzanine-87bc67df89.png', 
'format': 'png', 'url': 'https://archive.org/download/ddr-chi-1-1-1/ddr-chi-1-1-1-mezzanine-
87bc67df89.png', 'mimetype': 'image/png', 'encoding': None, 'sha1': 
'58b81cc20f4414c2336235cb0e64aed2e0703cf2', 'size': '29827', 'length': '', 'height': '', 'width': '', 
'title': ''}}}   
GeoffFroh commented 4 years ago

This issue is the result of changes to Python 3's configparser class:

"Config parsers do not guess datatypes of values in configuration files, always storing them internally as strings." (https://docs.python.org/3/library/configparser.html#supported-datatypes)

The problem code is in the config module:

OFFLINE = CONFIG.get('debug', 'offline')

(https://github.com/denshoproject/ddr-cmdln/blob/master/ddr/DDR/config.py#L44)

OFFLINE = CONFIG.getboolean('debug', 'offline')

The template attribute is generated by processing data from IA. This data is retrieved by the archivedotorg.get_ia_meta() function which is invoked by DDRObject.to_esobject() at:

https://github.com/denshoproject/ddr-cmdln/blob/master/ddr/DDR/models/common.py#L414

The logic checks that the value of the config var OFFLINE (set in the [debug] section of the app configs). The default value in ddrlocal.cfg is offline=False, and b/c the configparser.get() is used instead of .getboolean(), the resulting value of config.OFFLINE is the string, 'False'. Therefore, the expression:

if not config.OFFLINE:

always evaluates to boolean False, archivedotorg.get_ia_meta() is never invoked, the template attribute is not set, and the resulting ES doc is invalid.

Note that this configparser behavior in Python 3 may affect other Django projects in our portfolio

gjost commented 4 years ago

configparser boolean behavior updated in ddr-local commit f8d9c1c and ddr-cmdln commit 8e0bdeb for package ddrlocal-master_5.0.5~deb10.

GeoffFroh commented 4 years ago

configparser boolean behavior updated in ddr-local commit f8d9c1c and ddr-cmdln commit 8e0bdeb for package ddrlocal-master_5.0.5~deb10.

This fixes the issue with the archivedotorg.get_ia_meta() code being skipped (see: https://github.com/denshoproject/ddr-cmdln/issues/186#issuecomment-650662860), but the underlying issue still exists with some entities.

Indexing ddr-csujad-9-1 with the patched code did work (see: ddrstage.densho.org/ddr-csujad-9-1)

Indexing ddr-densho-400-20 with the patched code now hits the get_ia_meta() function, but throws this error:

(cmdln) ddr@kinkura:/home/densho$ ddrindex publish --hosts 192.168.0.20:9200 -r /media/qnfs/kinkura/gold/ddr-densho-400/files/ddr-densho-400-1
2020-06-29 12:08:14.549164-07:00 | 1/4 POST ddr-densho-400-1-transcript-bb74aa023d 
2020-06-29 12:08:14.602200-07:00 | 2/4 SKIP ddr-densho-400-1-master-70dda47d00 unpublishable
2020-06-29 12:08:14.611803-07:00 | 3/4 POST ddr-densho-400-1-mezzanine-70dda47d00 
2020-06-29 12:08:14.643226-07:00 | 4/4 POST ddr-densho-400-1 
Traceback (most recent call last):
  File "/opt/ddr-cmdln/venv/cmdln/bin/ddrindex", line 33, in <module>
    sys.exit(load_entry_point('ddr-cmdln==3.0.0.post1', 'console_scripts', 'ddrindex')())
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/cli/ddrindex.py", line 311, in publish
    path, recursive=recurse, force=force
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/docstore.py", line 649, in post_multi
    created = self.post(document, parents=parents, force=True)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/docstore.py", line 565, in post
    d = document.to_esobject(public_fields=public_fields, public=public)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/models/common.py", line 415, in to_esobject
    d.ia_meta = archivedotorg.get_ia_meta(self)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/archivedotorg.py", line 40, in get_ia_meta
    iaobject = IAObject(o.identifier.id)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/archivedotorg.py", line 99, in __init__
    self._gather_files_meta()
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/archivedotorg.py", line 134, in _gather_files_meta
    self.files[format_] = IAFile(self.id, format_, tag)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/archivedotorg.py", line 177, in __init__
    setattr(self, field, tag.find(field).contents[0])
IndexError: list index out of range

Here's the IA meta for ddr-csujad-9-1 (which works):

https://ia803004.us.archive.org/23/items/ddr-csujad-9-1/ddr-csujad-9-1_files.xml

And for ddr-densho-400-20 (which does not): https://ia802806.us.archive.org/6/items/ddr-densho-400-20/ddr-densho-400-20_files.xml

The only difference between the two sets of files appears to be the presence of an ogg file in the working entity (ddr-csujad-9-1).

Both of the underlying Entity files (i.e., entity.json) have genre set to interview and format set to av as per spec, and both have an mp3 file as the mezzanine and master file in the file_groups attribute.

gjost commented 4 years ago

I dropped and recreated my local Elasticsearch index and now I'm seeing the IndexError.

gjost commented 4 years ago

In this particular case the error is because the title field for the mp3 item in https://ia802804.us.archive.org/9/items/ddr-densho-400-4/ddr-densho-400-4_files.xml is blank. Is this something we care about?

Update: Looks like the original MP3 has empty title, album, and creator tags.

GeoffFroh commented 4 years ago

In this particular case the error is because the title field for the mp3 item in https://ia802804.us.archive.org/9/items/ddr-densho-400-4/ddr-densho-400-4_files.xml is blank. Is this something we care about?

Update: Looks like the original MP3 has empty title, album, and creator tags.

Looks like those are just the embedded ID3 tags which we don't use in the interface at all, so not important. The function should ignore if they're not present.

gjost commented 4 years ago

Empty tags coming from IA are now ignored.

Fixed in ddr-cmdln commit 711e303 for package ddrcmdln-master_5.0.5~deb10 / ddrlocal-master_5.0.5~deb10.