exponential-decay / demystify

Engine for analysis of Siegfried export files and DROID CSV. The tool has three purposes, break the export into its components and store them within a SQLite database; create additional columns to augment the output where useful; and query the SQLite database, outputting results in a readable form useful for analysis by researchers and archivists within digital preservation departments in memory institutions. The tool will find duplicates, unidentified files, blacklisted objects, character encoding issues, and more.
http://www.openplanetsfoundation.org/blogs/2014-06-03-analysis-engine-droid-csv-export
zlib License
23 stars 5 forks source link

SF YAML time parsing error #30

Closed tw4l closed 8 years ago

tw4l commented 8 years ago

Hi Ross,

Loving the new version of the analysis tool, but I seem to have hit a snag in trying it out. I tried to run droid2sqlite.py against a YAML export from Siegfried 1.5, using both PRONOM and tika namespaces. It's unsuccessful, seemingly due to the parsing of years in the SFHandlerClass.

STDERR is copied below:

mfmmessier:droid-sqlite-analysis-0.4.0 twalsh$ python droid2sqlite.py --export kolmactest.yaml
Traceback (most recent call last):
  File "droid2sqlite.py", line 79, in <module>
    main()
  File "droid2sqlite.py", line 72, in main
    identifyinput(args.export)
  File "droid2sqlite.py", line 21, in identifyinput
    return handleSFYAML(export)
  File "droid2sqlite.py", line 40, in handleSFYAML
    loader.sfDBSetup(sfexport, basedb.getcursor())
  File "/Users/twalsh/droid-sqlite-analysis-0.4.0/libs/SFLoaderClass.py", line 108, in sfDBSetup
    sf.addYear(sfdata)
  File "/Users/twalsh/droid-sqlite-analysis-0.4.0/libs/SFHandlerClass.py", line 284, in addYear
    row[self.FIELDYEAR] = self.getYear(year)
  File "/Users/twalsh/droid-sqlite-analysis-0.4.0/libs/SFHandlerClass.py", line 290, in getYear
    dt = datetime.datetime.strptime(datestring.split('+', 1)[0], '%Y-%m-%dT%H:%M:%S')
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/_strptime.py", line 328, in _strptime
    data_string[found.end():])
ValueError: unconverted data remains: -04:00

And the relevant lines from _strptime.py:

if len(data_string) != found.end():
        raise ValueError("unconverted data remains: %s" %
                          data_string[found.end():])

It appears that the issue may be in the time zone parsing but I wasn't able to figure out exactly what in the limited time I had to tinker with getYear in SFHandlerClass. Any ideas? (I'm running Python 2.7.10, if that makes any difference)

Thanks!

tw4l commented 8 years ago

Ah, of course the second I step away for a second I realize what's going on in SFHandlerClass.py:

def getYear(self, datestring):
      #sf example: 2016-04-02T20:45:12+13:00
      datestring = datestring.replace('Z', '') #TODO: Handle 'Z' (Nato: Zulu) time (ZIPs only?)
      dt = datetime.datetime.strptime(datestring.split('+', 1)[0], '%Y-%m-%dT%H:%M:%S')
      return int(dt.year)

dt is splitting the time zone info on +, but in this case the timezone code actually starts with a '-', not a '+'.

The following seems to work, although it does introduce another dependency (namely, dateutil.parser):

   def getYear(self, datestring):
      #sf example: 2016-04-02T20:45:12+13:00
      datestring = datestring.replace('Z', '') #TODO: Handle 'Z' (Nato: Zulu) time (ZIPs only?)
      dt = dateutil.parser.parse(datestring)
      return int(dt.year)
ross-spencer commented 8 years ago

Hi Tim,

Thanks for this! It's great! And thanks for going the extra step of isolating the issue.

Rather than put in another dependency I've opted to create a unit test for two date handling functions, as well as return 'NULL' when there's a problem retrieving the date. At least the script now should output something useful without failing.

I've placed a try/except in the code as well as a final backup of treating the date as a string.

It feels quite hacky and wish Python's native support was stronger but I hope these look okay for you.

Please let me know what you think of the overall output, and performance once you've got it up and running okay.

(I've also spotted an error with the --export flag in droidsqliteanalysis.py at the same time. You can run droidsqliteanalysis.py --export to avoid having to run droid2sqlite.py as well)

Cheers,

Ross