Closed gjost closed 6 years ago
Objects with dates WITHOUT the three-letter timezone name are parsed correctly, while those WITH the tzname are not.
OK ddr-densho-330-33 "record_created": "2017-03-14T14:59:24-0700"
BAD ddr-densho-330-34 "record_created": "2017-03-14T15:03:34PDT-0700"
>>> from DDR import identifier
>>> o = identifier.Identifier('ddr-densho-330-33', base_path='/var/www/media/ddr').object()
>>> o.record_created
datetime.datetime(2017, 3, 14, 14, 59, 24, tzinfo=tzoffset(None, -25200))
>>> o = identifier.Identifier('ddr-densho-330-34', base_path='/var/www/media/ddr').object()
>>> o.record_created
<built-in method now of type object at 0x55f74cf68b80>
The problem with ddr-densho-330-34 is NOT the datetime
. Rather, the file has a JSON syntax error. The file seems to have been edited by hand and a topic removed, leaving a dangling comma:
...
{
"rights_statement": ""
},
{
"topics": [
{
"id": "293",
"term": "Geographic communities: Washington: Seattle"
},
]
},
{
"persons": []
},
...
DDR.models.load_json
tries to be helpful instead of just throwing an exception.
Switched from the stock Python json
lib to simplejson
because it has nicer JSON validation error messages that actually tell you the line number and char on which problems occur.
The DDR.models.load_json
function was also modified so that it no longer fails silently -- command-line tools and the web UI will no fail with an error message if JSON files are not valid.
This are features: instead of having files silently fail and get overwritten, errors will force users to deal with invalid JSON.
Things to test:
ddrindex
, ddr-transform
, ddr-signatures
, etc will fail with the JSONDecodeError message detailing where the problem is. It should be apparent which file is the culprit. If not, running ddr-transform
should surface the problem.pkikawa [09:08] hey geoff, so the new parser works - it gives the line number where it fails to decode, but it doesn't give us the object in which it blows up. how difficult would it be to change the standard out to include the file being processed before being processed?
There's not a simple answer. JSON files are handled differently by various parts of the application. ddr-transform
prints out each filename, then loads the JSON file, so with that tool you have the filename just before the error. ddr-signatures
, OTOH, parses all the files in a loop at the beginning and doesn't print anything out so it currently doesn't display the filenames. In a background task in the web UI (celery) you'll get the stacktrace but no filename at all.
Maybe we need a function that only validates all the JSON files in a collection, and returns a list of the failures. This could be a cmdln tool and also a function you could call from the UI if some other process failed. It would be easier to make something like this than to modify all our existing code.
Background: The DDR.models.load_json
function previously let JSON validation errors pass silently. It was trying to be "helpful", but makes it look like the problem is somewhere else. There are certain (many?) cases where we want validation to fail immediately and loudly.
@GFroh, how would you like to handle this? This particular error surfaced when somebody hand-edited a JSON file. Maybe for this iteration right now we keep the old behavior (JSON validation errors fail silently) but resolve to alter the behavior after this release?
@gjost: I agree that json validation should fail loudly -- definitely an opportunity for difficult-to-diagnose data corruption otherwise. The new parser is nice b/c it does give the line number where the exception was thrown.
Would it be possible to just sort the list of filepaths to be processed after the command walks the os to identify the jsons in a repo? That way, if it blows up, we'd at least know where the batch failed by looking at stdout.
I put together a couple ways to check a collection:
ddrcheck COLLECTION
commandddr-transform
prints out each filename before it tries to load. If there's a JSON validation error you'll know which one failed.
$ ddr-transform -t -R /media/qnfs/kinkura/temp/rainierworkdir/ddr-denho-330
(ddrlocal) ddr@ddreditor:~$ ddrcheck /var/www/media/ddr/ddr-densho-330
Gathering files in /var/www/media/ddr/ddr-densho-330
Checking files...
Checked 955 files
(ddrlocal) ddr@ddreditor:~$ nano /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330/files/ddr-densho-330-34/entity.json
(ddrlocal) ddr@ddreditor:~$ nano /var/www/media/ddr/ddr-densho-330
(ddrlocal) ddr@ddreditor:~$ nano /var/www/media/ddr/ddr-densho-330/files/ddr-densho-330-34/entity.json
(ddrlocal) ddr@ddreditor:~$ ddr-transform -t -R /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330
2018-03-27 13:46:33,883 INFO Not committing changes
2018-03-27 13:46:33,883 INFO Loading collection
2018-03-27 13:46:33,903 INFO <DDR.models.Collection collection:ddr-densho-330>
2018-03-27 13:46:33,903 INFO Finding metadata files
2018-03-27 13:46:36,042 INFO 955 paths
2018-03-27 13:46:36,272 INFO Writing
2018-03-27 13:46:36,274 INFO 0/955 /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330/collection.json
2018-03-27 13:46:36,752 INFO 1/955 /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330/files/ddr-densho-330-215/entity.json
2018-03-27 13:46:36,756 INFO getting vocab: http://partner.densho.org/vocab/api/0.2//topics.json
2018-03-27 13:46:37,227 INFO 2/955 /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330/files/ddr-densho-330-215/files/ddr-densho-330-215-mezzanine-62e542a261.json
2018-03-27 13:46:37,267 INFO 3/955 /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330/files/ddr-densho-330-215/files/ddr-densho-330-215-master-98e8a2d1ec.json
2018-03-27 13:46:37,309 INFO 4/955 /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330/files/ddr-densho-330-171/entity.json
2018-03-27 13:46:37,447 INFO 5/955 /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330/files/ddr-densho-330-171/files/ddr-densho-330-171-mezzanine-e9feee2665.json
2018-03-27 13:46:37,480 INFO 6/955 /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330/files/ddr-densho-330-171/files/ddr-densho-330-171-master-d466934846.json
2018-03-27 13:46:37,523 INFO 7/955 /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330/files/ddr-densho-330-143/entity.json
2018-03-27 13:46:37,636 INFO 8/955 /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330/files/ddr-densho-330-143/files/ddr-densho-330-143-master-ab0e84ae87.json
yadayadayada...
2018-03-27 13:46:41,758 INFO 71/955 /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330/files/ddr-densho-330-95/files/ddr-densho-330-95-master-6e00d6a4f1.json
2018-03-27 13:46:41,800 INFO 72/955 /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330/files/ddr-densho-330-95/files/ddr-densho-330-95-mezzanine-90c44414ea.json
Traceback (most recent call last):
File "/opt/ddr-local/venv/ddrlocal/bin/ddr-transform", line 7, in <module>
__import__('pkg_resources').run_script('ddr-cmdln==0.9.4b0', 'ddr-transform')
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1494, in run_script
exec(code, namespace, namespace)
File "/opt/ddr-local/venv/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-transform", line 146, in <module>
main()
File "/opt/ddr-local/venv/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-transform", line 141, in main
mail=args.mail,
File "/opt/ddr-local/venv/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-transform", line 76, in transform
o = identifier.Identifier(path).object()
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/identifier.py", line 1072, in object
return self.object_class(mappings).from_identifier(self)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 1544, in from_identifier
return from_json(Entity, identifier.path_abs('json'), identifier)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 376, in from_json
document.load_json(fileio.read_text(json_path))
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 1643, in load_json
json_data = load_json(self, module, json_text)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 263, in load_json
json_data = json.loads(json_text)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/simplejson/__init__.py", line 518, in loads
return _default_decoder.decode(s)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 87 column 9 (char 2455)
from past experience we know it is blowing up on object ddr-densho-330-34 - the line and char numbers are correct in identifying the apostatic comma
That stacktrace looks to me like it's incomplete. Here is what I see when I run ddr-transform on the same collection. It prints out the path for each file before it tries to transform it. On my machine the error happens on file 337/955.
(ddrlocal) ddr@denshodeb9:/var/www/media/ddr$ ddr-transform -t -R /var/www/media/ddr/ddr-densho-330
2018-03-27 13:28:18,928 INFO Not committing changes
2018-03-27 13:28:18,929 INFO Loading collection
2018-03-27 13:28:18,958 INFO <DDR.models.Collection collection:ddr-densho-330>
2018-03-27 13:28:18,958 INFO Finding metadata files
2018-03-27 13:28:19,029 INFO 955 paths
2018-03-27 13:28:19,252 INFO Writing
2018-03-27 13:28:19,255 INFO 0/955 /var/www/media/ddr/ddr-densho-330/collection.json
2018-03-27 13:28:19,885 INFO 1/955 /var/www/media/ddr/ddr-densho-330/files/ddr-densho-330-65/entity.json
2018-03-27 13:28:19,890 INFO getting vocab: http://partner.densho.org/vocab/api/0.2//topics.json
2018-03-27 13:28:20,594 INFO 2/955 /var/www/media/ddr/ddr-densho-330/files/ddr-densho-330-65/files/ddr-densho-330-65-master-83f8e96721.json
2018-03-27 13:28:20,651 INFO 3/955 /var/www/media/ddr/ddr-densho-330/files/ddr-densho-330-65/files/ddr-densho-330-65-mezzanine-66e4d15b49.json
2018-03-27 13:28:20,713 INFO 4/955 /var/www/media/ddr/ddr-densho-330/files/ddr-densho-330-59/entity.json
2018-03-27 13:28:20,904 INFO 5/955 /var/www/media/ddr/ddr-densho-330/files/ddr-densho-330-59/files/ddr-densho-330-59-mezzanine-22cf08c734.json
...
Not gonna waste bits copying everything...
...
2018-03-27 13:29:29,814 INFO 335/955 /var/www/media/ddr/ddr-densho-330/files/ddr-densho-330-157/entity.json
2018-03-27 13:29:29,954 INFO 336/955 /var/www/media/ddr/ddr-densho-330/files/ddr-densho-330-157/files/ddr-densho-330-157-master-63945c0e68.json
2018-03-27 13:29:30,023 INFO 337/955 /var/www/media/ddr/ddr-densho-330/files/ddr-densho-330-157/files/ddr-densho-330-157-mezzanine-7c0229031f.json
Traceback (most recent call last):
File "/opt/ddr-local/venv/ddrlocal/bin/ddr-transform", line 7, in <module>
__import__('pkg_resources').run_script('ddr-cmdln==0.9.4b0', 'ddr-transform')
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1494, in run_script
exec(code, namespace, namespace)
File "/opt/ddr-local/venv/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-transform", line 146, in <module>
main()
File "/opt/ddr-local/venv/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-transform", line 141, in main
mail=args.mail,
File "/opt/ddr-local/venv/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-transform", line 76, in transform
o = identifier.Identifier(path).object()
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/identifier.py", line 1072, in object
return self.object_class(mappings).from_identifier(self)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 1544, in from_identifier
return from_json(Entity, identifier.path_abs('json'), identifier)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 376, in from_json
document.load_json(fileio.read_text(json_path))
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 1643, in load_json
json_data = load_json(self, module, json_text)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 263, in load_json
json_data = json.loads(json_text)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/simplejson/__init__.py", line 518, in loads
return _default_decoder.decode(s)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 87 column 9 (char 2242)
If that doesn't work for you, try running the new ddrcheck command on the same collection:
(ddrlocal) ddr@denshodeb9:/var/www/media/ddr$ ddrcheck /var/www/media/ddr/ddr-densho-330
Gathering files in /var/www/media/ddr/ddr-densho-330
Checking files...
338/955 ERROR /var/www/media/ddr/ddr-densho-330/files/ddr-densho-330-34/entity.json - Expecting value: line 87 column 9 (char 2242)
Checked 955 files
We're done here, are we not?
I think we should move the logging line to the beginning of the for loop. Currently, it does not print the filepath being processed until after the json parser is invoked. This means that you can see the line number that caused the exception, but cannot see which file name created the issue.
The ddrcheck command is helpful, but this simple change will make debugging much less time-consuming.
See:
https://github.com/densho/ddr-cmdln/blob/065-json/ddr/bin/ddr-transform#L81
Sure, I'll do that.
ah yeah, that previous stacktrace didnt include the begining. comment edited to restore start of stacktrace
Yeah the break between the command and the start of the stacktrace tripped me up there.
ddr-transform still not listing the culprit file
2018-03-27 14:52:48,277 INFO 69/955 /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330/files/ddr-densho-330-129/files/ddr-densho-330-129-mezzanine-46dda8f80a.json
2018-03-27 14:52:48,318 INFO 70/955 /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330/files/ddr-densho-330-95/entity.json
2018-03-27 14:52:48,478 INFO 71/955 /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330/files/ddr-densho-330-95/files/ddr-densho-330-95-master-6e00d6a4f1.json
2018-03-27 14:52:48,518 INFO 72/955 /media/qnfs/kinkura/temp/rainerworkdir/ddr-densho-330/files/ddr-densho-330-95/files/ddr-densho-330-95-mezzanine-90c44414ea.json
Traceback (most recent call last):
File "/opt/ddr-local/venv/ddrlocal/bin/ddr-transform", line 7, in <module>
__import__('pkg_resources').run_script('ddr-cmdln==0.9.4b0', 'ddr-transform')
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1494, in run_script
exec(code, namespace, namespace)
File "/opt/ddr-local/venv/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-transform", line 146, in <module>
main()
File "/opt/ddr-local/venv/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-transform", line 141, in main
mail=args.mail,
File "/opt/ddr-local/venv/ddrlocal/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/EGG-INFO/scripts/ddr-transform", line 76, in transform
o = identifier.Identifier(path).object()
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/identifier.py", line 1072, in object
return self.object_class(mappings).from_identifier(self)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 1544, in from_identifier
return from_json(Entity, identifier.path_abs('json'), identifier)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 376, in from_json
document.load_json(fileio.read_text(json_path))
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 1643, in load_json
json_data = load_json(self, module, json_text)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/ddr_cmdln-0.9.4b0-py2.7.egg/DDR/models/__init__.py", line 263, in load_json
json_data = json.loads(json_text)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/simplejson/__init__.py", line 518, in loads
return _default_decoder.decode(s)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/opt/ddr-local/venv/ddrlocal/local/lib/python2.7/site-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 87 column 9 (char 2455)
UPDATE: This was reported as a datetime parsing error but is actually a JSON validation error - see below.
pkikawa