calvez / xcoaitoolkit

Automatically exported from code.google.com/p/xcoaitoolkit
0 stars 0 forks source link

OAI Toolkit crashes for bad xml data #20

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. This problem happens in the load process of the OAI Toolkit, when the 
xml is bad and which does not parse properly and so is not well formed.
2. This causes the OAI Toolkit to crash, and terminate unexpectedly.
3. The rest of the xml files do not get loaded after this termination 
happens.

What is the expected output? What do you see instead?
Expected is that the OAI Toolkit would handle it gracefully. Show the 
appropriate error, and take the next marc-xml file.

What version of the product are you using? On what operating system?
This is a problem right from the start and reproduced using the 0.5 and 
0.6 version of OAI Toolkit. Reproduced on Linux and Windows servers.

Please provide any additional information below:

Original issue reported on code.google.com by sva...@library.rochester.edu on 10 Jul 2009 at 5:49

GoogleCodeExporter commented 9 years ago
The approach to solve it is investigate and test the problem to find the 
concerned 
area inside the OAI Toolkit.

Then would be to catch the error and prevent it from crashing. The OAI Toolkit 
could 
continue then processing other xml files.

The files might be modified is Importer.java, XMLUtil.java.

Right now the stack trace when the OAI Toolkit crash takes place looks 
something 
like:

........2009-07-01 18:11:46,906 [main] (Importer.java:492) INFO  - [PRG] Modify 
statistics for 0_uiu_bibs_2_70000.xml: converted: 10000, invalid: 0 records. It 
took 
00:00:10.084
2009-07-01 18:11:47,897 [main] (Importer.java:744) INFO  - [PRG] This is a 
valid 
MARCXML file.
2009-07-01 18:11:47,897 [main] (Importer.java:435) INFO  - [PRG] Modifying 
records...
.......... (10%)
.......... (20%)
.....[Fatal Error] :172386:6: The content of elements must consist of 
well-formed 
character data or markup.
org.marc4j.MarcException: Unable to parse input
    at org.marc4j.MarcXmlParser.parse(MarcXmlParser.java:95)
    at org.marc4j.MarcXmlParser.parse(MarcXmlParser.java:64)
    at org.marc4j.MarcXmlParserThread.run(MarcXmlParserThread.java:115)
Caused by: org.xml.sax.SAXParseException: The content of elements must consist 
of 
well-formed character data or markup.
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse
(AbstractSAXParser.java:1231)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse
(SAXParserImpl.java:522)
    at org.marc4j.MarcXmlParser.parse(MarcXmlParser.java:93)
    ... 2 more

Original comment by sva...@library.rochester.edu on 10 Jul 2009 at 5:52

GoogleCodeExporter commented 9 years ago
There was a bug here, that the marc-xml file was not getting validated before 
loading in the OAI Toolkit, which caused the marc4j to crash. The change has 
been 
done, so that the file validates in 2 ways:
1. Checks for its well-formedness.
2. Then it validates against the schema.
If it passes through these properly, it gets loaded in the OAI Toolkit.
Otherwise the appropriate error is shown to the user in the logs.

Original comment by sva...@library.rochester.edu on 17 Nov 2009 at 2:58

GoogleCodeExporter commented 9 years ago
Incorporated in the 0.6.3 version of the software.

Original comment by sva...@library.rochester.edu on 17 Nov 2009 at 4:16