PRImA-Research-Lab / prima-page-viewer

Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR.
Apache License 2.0
34 stars 9 forks source link

ALTO import not working #1

Closed kirkhess closed 9 years ago

kirkhess commented 9 years ago

I tried importing an ALTO document from another project and I receive a null pointer exception. Here's the one that didn't work: https://gist.github.com/kirkhess/fb775b9ad26e04f1514c

Stack Trace java.lang.NullPointerException at org.primaresearch.dla.page.io.xml.sax.SaxPageHandler_2013_07_15.handlePageElement(SaxPageHandler_2013_07_15.java:477) at org.primaresearch.dla.page.io.xml.sax.SaxPageHandler_2013_07_15.startElement(SaxPageHandler_2013_07_15.java:122) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:509) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:378) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2778) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649) at org.primaresearch.dla.page.io.xml.XmlPageReader.parse(XmlPageReader.java:221) at org.primaresearch.dla.page.io.xml.XmlPageReader.read(XmlPageReader.java:135) at org.primaresearch.dla.page.io.xml.PageXmlInputOutput.readPage(PageXmlInputOutput.java:201) at org.primaresearch.page.viewer.dla.XmlDocumentLayoutLoader.doRun(XmlDocumentLayoutLoader.java:35) at org.primaresearch.page.viewer.extra.Task.run(Task.java:37) at org.primaresearch.page.viewer.extra.Task$TaskThread.run(Task.java:93) java.lang.NullPointerException at org.primaresearch.page.viewer.ui.views.DocumentImageView.refresh(DocumentImageView.java:109) at org.primaresearch.page.viewer.EventListener.toggleDisplayMode(EventListener.java:294) at org.primaresearch.page.viewer.EventListener.widgetSelected(EventListener.java:100) at org.eclipse.swt.widgets.TypedListener.handleEvent(Unknown Source) at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown Source) at org.eclipse.swt.widgets.Display.sendEvent(Unknown Source) at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source) at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source) at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source) at org.eclipse.swt.widgets.Widget.notifyListeners(Unknown Source) at org.eclipse.swt.widgets.Display.runDeferredEvents(Unknown Source) at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source) at org.primaresearch.page.viewer.PageViewer.(PageViewer.java:86) at org.primaresearch.page.viewer.PageViewer.main(PageViewer.java:50)

I had better luck testing with some ALTO v2.0 data downloaded from ChroniclingAmerica. http://chroniclingamerica.loc.gov/ocr/ Working one:https://gist.githubusercontent.com/kirkhess/09f9f381780a50fe62d2

chris1010010 commented 9 years ago

Hi,

Thanks for letting us know. I get the same exception. It is a very old ALTO version which is currently not supported. I will have a look if it is straightforward to support it and if yes, I will implement it.

Kind regards,

Christian

Christian Clausner

Research Fellow | School of Computing, Science & Engineering

Room 249, Newton Building, University of Salford, Salford M5 4WT

t: +44 (0) 161 295 4497

mailto:C.Papadopoulos@salford.ac.uk C.Clausner@salford.ac.uk | http://www.salford.ac.uk/ www.salford.ac.uk

mailto:C.Papadopoulos@primaresearch.org C.Clausner@primaresearch.org | http://www.primaresearch.org www.primaresearch.org

MASTER_Salford logo.jpg

From: Kirk Hess [mailto:notifications@github.com] Sent: 30 October 2014 14:26 To: PRImA-Research-Lab/prima-page-viewer Subject: [prima-page-viewer] ALTO import not working (#1)

I tried importing an ALTO document from another project and I receive a null pointer exception. Here's the one that didn't work: https://gist.github.com/kirkhess/fb775b9ad26e04f1514c

Stack Trace java.lang.NullPointerException at org.primaresearch.dla.page.io.xml.sax.SaxPageHandler_2013_07_15.handlePageElement(SaxPageHandler_2013_07_15.java:477) at org.primaresearch.dla.page.io.xml.sax.SaxPageHandler_2013_07_15.startElement(SaxPageHandler_2013_07_15.java:122) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:509) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:378) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2778) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649) at org.primaresearch.dla.page.io.xml.XmlPageReader.parse(XmlPageReader.java:221) at org.primaresearch.dla.page.io.xml.XmlPageReader.read(XmlPageReader.java:135) at org.primaresearch.dla.page.io.xml.PageXmlInputOutput.readPage(PageXmlInputOutput.java:201) at org.primaresearch.page.viewer.dla.XmlDocumentLayoutLoader.doRun(XmlDocumentLayoutLoader.java:35) at org.primaresearch.page.viewer.extra.Task.run(Task.java:37) at org.primaresearch.page.viewer.extra.Task$TaskThread.run(Task.java:93) java.lang.NullPointerException at org.primaresearch.page.viewer.ui.views.DocumentImageView.refresh(DocumentImageView.java:109) at org.primaresearch.page.viewer.EventListener.toggleDisplayMode(EventListener.java:294) at org.primaresearch.page.viewer.EventListener.widgetSelected(EventListener.java:100) at org.eclipse.swt.widgets.TypedListener.handleEvent(Unknown Source) at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown Source) at org.eclipse.swt.widgets.Display.sendEvent(Unknown Source) at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source) at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source) at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source) at org.eclipse.swt.widgets.Widget.notifyListeners(Unknown Source) at org.eclipse.swt.widgets.Display.runDeferredEvents(Unknown Source) at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source) at org.primaresearch.page.viewer.PageViewer.(PageViewer.java:86) at org.primaresearch.page.viewer.PageViewer.main(PageViewer.java:50)

I had better luck testing with some ALTO v2.0 data downloaded from ChroniclingAmerica. http://chroniclingamerica.loc.gov/ocr/ Working one:https://gist.githubusercontent.com/kirkhess/09f9f381780a50fe62d2

— Reply to this email directly or view it on GitHub https://github.com/PRImA-Research-Lab/prima-page-viewer/issues/1 . https://github.com/notifications/beacon/AHgOxh5XjjisQybA_xbJETCt8En7O1e4ks5nIkIOgaJpZM4C07Er.gif

chris1010010 commented 9 years ago

Simply adding ALTO 1.1 to the list of supported versions didn't work because the XML schema could not be loaded. This is apparently caused by a namespace import (xlink) within the schema.

chris1010010 commented 9 years ago

Added limited support for ALTO 1.1 (no warranty). Get the latest source from prima-core-libs or wait for release 1.2.