Closed sarabeckman closed 3 years ago
This is the same behavior reported in https://github.com/denshoproject/ddr-cmdln/issues/155
Note that we confirmed that the problem entities are present at IA and have the proper metadata (see: https://github.com/denshoproject/ddr-cmdln/issues/123)
Confirmed that archivedotorg.py
is apparently functioning:
>>> from DDR import archivedotorg, models, identifier
>>> e = models.Entity.from_identifier(Identifier("ddr-chi-1-1-1","/media/qnfs/kinkura/gold/"))
>>> iameta = archivedotorg.get_ia_meta(e)
>>> iameta
{'id': 'ddr-chi-1-1-1', 'xml_url': 'https://archive.org/download/ddr-chi-1-1-1/ddr-chi-1-1-1_files.xml',
'http_status': 200, 'original': 'ddr-chi-1-1-1-mezzanine-87bc67df89.mpg', 'mimetype':
'video/mpeg', 'files': {'mp3': {'name': 'ddr-chi-1-1-1-mezzanine-87bc67df89.mp3', 'format': 'mp3',
'url': 'https://archive.org/download/ddr-chi-1-1-1/ddr-chi-1-1-1-mezzanine-87bc67df89.mp3',
'mimetype': 'audio/mpeg', 'encoding': None, 'sha1':
'd91cb8611509815d52db60381b7375db2260620d', 'size': '1744963', 'length': '145.15', 'height':
'0', 'width': '0', 'title': ''}, 'mp4': {'name': 'ddr-chi-1-1-1-mezzanine-87bc67df89.mp4', 'format':
'mp4', 'url': 'https://archive.org/download/ddr-chi-1-1-1/ddr-chi-1-1-1-mezzanine-
87bc67df89.mp4', 'mimetype': 'video/mp4', 'encoding': None, 'sha1':
'21a31c0e996d9da39c7c275504262ab433533a9d', 'size': '15178544', 'length': '145.15', 'height':
'480', 'width': '853', 'title': ''}, 'mpg': {'name': 'ddr-chi-1-1-1-mezzanine-87bc67df89.mpg', 'format':
'mpg', 'url': 'https://archive.org/download/ddr-chi-1-1-1/ddr-chi-1-1-1-mezzanine-
87bc67df89.mpg', 'mimetype': 'video/mpeg', 'encoding': None, 'sha1':
'87bc67df89c2f45a0e1555c52afab8bf1fa433f8', 'size': '653512708', 'length': '145.13', 'height':
'1080', 'width': '1920', 'title': ''}, 'png': {'name': 'ddr-chi-1-1-1-mezzanine-87bc67df89.png',
'format': 'png', 'url': 'https://archive.org/download/ddr-chi-1-1-1/ddr-chi-1-1-1-mezzanine-
87bc67df89.png', 'mimetype': 'image/png', 'encoding': None, 'sha1':
'58b81cc20f4414c2336235cb0e64aed2e0703cf2', 'size': '29827', 'length': '', 'height': '', 'width': '',
'title': ''}}}
This issue is the result of changes to Python 3's configparser
class:
"Config parsers do not guess datatypes of values in configuration files, always storing them internally as strings." (https://docs.python.org/3/library/configparser.html#supported-datatypes)
The problem code is in the config
module:
OFFLINE = CONFIG.get('debug', 'offline')
(https://github.com/denshoproject/ddr-cmdln/blob/master/ddr/DDR/config.py#L44)
OFFLINE = CONFIG.getboolean('debug', 'offline')
The template
attribute is generated by processing data from IA. This data is retrieved by the archivedotorg.get_ia_meta()
function which is invoked by DDRObject.to_esobject()
at:
https://github.com/denshoproject/ddr-cmdln/blob/master/ddr/DDR/models/common.py#L414
The logic checks that the value of the config var OFFLINE
(set in the [debug]
section of the app configs). The default value in ddrlocal.cfg
is offline=False
, and b/c the configparser.get()
is used instead of .getboolean()
, the resulting value of config.OFFLINE
is the string, 'False'
. Therefore, the expression:
if not config.OFFLINE:
always evaluates to boolean False
, archivedotorg.get_ia_meta()
is never invoked, the template
attribute is not set, and the resulting ES doc is invalid.
Note that this configparser
behavior in Python 3 may affect other Django projects in our portfolio
configparser
boolean behavior updated in ddr-local
commit f8d9c1c
and ddr-cmdln
commit 8e0bdeb
for package ddrlocal-master_5.0.5~deb10
.
configparser
boolean behavior updated inddr-local
commitf8d9c1c
andddr-cmdln
commit8e0bdeb
for packageddrlocal-master_5.0.5~deb10
.
This fixes the issue with the archivedotorg.get_ia_meta()
code being skipped (see: https://github.com/denshoproject/ddr-cmdln/issues/186#issuecomment-650662860), but the underlying issue still exists with some entities.
Indexing ddr-csujad-9-1
with the patched code did work (see: ddrstage.densho.org/ddr-csujad-9-1)
Indexing ddr-densho-400-20
with the patched code now hits the get_ia_meta()
function, but throws this error:
(cmdln) ddr@kinkura:/home/densho$ ddrindex publish --hosts 192.168.0.20:9200 -r /media/qnfs/kinkura/gold/ddr-densho-400/files/ddr-densho-400-1
2020-06-29 12:08:14.549164-07:00 | 1/4 POST ddr-densho-400-1-transcript-bb74aa023d
2020-06-29 12:08:14.602200-07:00 | 2/4 SKIP ddr-densho-400-1-master-70dda47d00 unpublishable
2020-06-29 12:08:14.611803-07:00 | 3/4 POST ddr-densho-400-1-mezzanine-70dda47d00
2020-06-29 12:08:14.643226-07:00 | 4/4 POST ddr-densho-400-1
Traceback (most recent call last):
File "/opt/ddr-cmdln/venv/cmdln/bin/ddrindex", line 33, in <module>
sys.exit(load_entry_point('ddr-cmdln==3.0.0.post1', 'console_scripts', 'ddrindex')())
File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/cli/ddrindex.py", line 311, in publish
path, recursive=recurse, force=force
File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/docstore.py", line 649, in post_multi
created = self.post(document, parents=parents, force=True)
File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/docstore.py", line 565, in post
d = document.to_esobject(public_fields=public_fields, public=public)
File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/models/common.py", line 415, in to_esobject
d.ia_meta = archivedotorg.get_ia_meta(self)
File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/archivedotorg.py", line 40, in get_ia_meta
iaobject = IAObject(o.identifier.id)
File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/archivedotorg.py", line 99, in __init__
self._gather_files_meta()
File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/archivedotorg.py", line 134, in _gather_files_meta
self.files[format_] = IAFile(self.id, format_, tag)
File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/archivedotorg.py", line 177, in __init__
setattr(self, field, tag.find(field).contents[0])
IndexError: list index out of range
Here's the IA meta for ddr-csujad-9-1 (which works):
https://ia803004.us.archive.org/23/items/ddr-csujad-9-1/ddr-csujad-9-1_files.xml
And for ddr-densho-400-20 (which does not): https://ia802806.us.archive.org/6/items/ddr-densho-400-20/ddr-densho-400-20_files.xml
The only difference between the two sets of files appears to be the presence of an ogg file in the working entity (ddr-csujad-9-1).
Both of the underlying Entity files (i.e., entity.json
) have genre
set to interview
and format
set to av
as per spec, and both have an mp3 file as the mezzanine
and master
file in the file_groups
attribute.
I dropped and recreated my local Elasticsearch index and now I'm seeing the IndexError
.
In this particular case the error is because the title
field for the mp3 item in https://ia802804.us.archive.org/9/items/ddr-densho-400-4/ddr-densho-400-4_files.xml is blank. Is this something we care about?
Update: Looks like the original MP3 has empty title
, album
, and creator
tags.
In this particular case the error is because the
title
field for the mp3 item in https://ia802804.us.archive.org/9/items/ddr-densho-400-4/ddr-densho-400-4_files.xml is blank. Is this something we care about?Update: Looks like the original MP3 has empty
title
,album
, andcreator
tags.
Looks like those are just the embedded ID3 tags which we don't use in the interface at all, so not important. The function should ignore if they're not present.
Empty tags coming from IA are now ignored.
Fixed in ddr-cmdln
commit 711e303
for package ddrcmdln-master_5.0.5~deb10
/ ddrlocal-master_5.0.5~deb10
.
Indexed ddr-chi-1 and ddr-densho-400 to ddrstage. Neither are displaying correctly.
For ddr-chi-1 - the video download links are not working, the player is the old player.
For ddr-densho-400 the type is an AV object that is audio only. It should appear as this CSUJAD interview on production http://ddr.densho.org/ddr-csujad-9-1/. I indexed ddr-csujad-9 to ddrstage to compare and it is not displaying.
For comparison, here is a good doc from the production ES cluster (indexed at 2019-03-11T11:55:53):
GOODddr-csujad-9-1-ESdoc.json.txt
And here is a bad doc from the stage ES cluster that was indexed with
ddr-cmdln
v5.0.4 on master:BADddr-csujad-9-1-ESdoc.json.txt
This behavior is also causing
ddr-public
to use the incorrect version of the av templates (i.e., the old segment template that uses the deprecated embedded IA player). Here is a bad doc from the production ES index for ddr-chi-1-1:BADddr-chi-1-1-1-ESdoc.json.txt