abrin / foresite-toolkit

Automatically exported from code.google.com/p/foresite-toolkit
0 stars 0 forks source link

Improve error reporting when parsing RDF #7

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Certain errors are swallowed silently when Foresite parses RDF XML.  It
appears that certain syntax errors (e.g. illegal/not-well-formed URIs) in
ORE RDF XML are not found eagerly.  They are found at the time Jena
attempts to read that portion of the graph.

For example, calling JenaOREParser.parse(InputStream is) with the attached
ReM returns null, with no errors reported.  If
JenaOREParser.parse(InputStream is) is modified to write out the model
prior to creating the ReM, the error is reported.  It would be nice if
Foresite/Jena could eagerly parse a ReM and report errors without silently
swallowing them.

After modifying JenaOREParser.parse(InputStream is) to write out the model,
I was able to see the error:
ERROR [main]: datapub.mapping.ForesiteOreRemMapper@104 2009-12-14
11:08:11,587 ORE ReM parsing failed: Only well-formed absolute URIrefs can
be included in RDF/XML output: <info://figure:figure1> Code:
0/ILLEGAL_CHARACTER in PORT: The character violates the grammar rules for
URIs/IRIs. 
com.hp.hpl.jena.shared.BadURIException: Only well-formed absolute URIrefs
can be included in RDF/XML output: <info://figure:figure1> Code:
0/ILLEGAL_CHARACTER in PORT: The character violates the grammar rules for
URIs/IRIs.
    at
com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.checkURI(BaseXMLWriter.java:768)
    at
com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.relativize(BaseXMLWriter.java:745)
    at com.hp.hpl.jena.xmloutput.impl.Basic.writeResourceReference(Basic.java:154)
    at com.hp.hpl.jena.xmloutput.impl.Basic.writePredicate(Basic.java:101)
    at com.hp.hpl.jena.xmloutput.impl.Basic.writeRDFStatements(Basic.java:77)
    at com.hp.hpl.jena.xmloutput.impl.Basic.writeRDFStatements(Basic.java:66)
    at com.hp.hpl.jena.xmloutput.impl.Basic.writeBody(Basic.java:40)
    at
com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.writeXMLBody(BaseXMLWriter.java:452
)
    at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.write(BaseXMLWriter.java:424)
    at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.write(BaseXMLWriter.java:410)
    at com.hp.hpl.jena.rdf.model.impl.ModelCom.write(ModelCom.java:270)
    at org.dspace.foresite.jena.JenaOREParser.parse(JenaOREParser.java:70)
    at
edu.jhu.library.datapub.mapping.ForesiteOreRemMapper.fromPublisherRem(ForesiteOr
eRemMapper.java:97)
    at
edu.jhu.library.datapub.mapping.MapperTest.testSimplePublisherMapping(MapperTest
.java:78)

Original issue reported on code.google.com by emets...@gmail.com on 14 Dec 2009 at 4:20

Attachments:

GoogleCodeExporter commented 9 years ago
Attaching correct file.

Original comment by emets...@gmail.com on 14 Dec 2009 at 4:24

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by azarot...@gmail.com on 15 Dec 2009 at 3:50

GoogleCodeExporter commented 9 years ago
Test case added to repository, will accept patch if available?

Original comment by azarot...@gmail.com on 15 Dec 2009 at 4:05

GoogleCodeExporter commented 9 years ago
Thanks for accepting!  I'm not sure yet what a good solution for this issue is. 
 I
suspect that a complete solution would require a bit of coding.  It would be 
nice to
take a ReM and check to see that it conforms to the assertions stated in
http://www.openarchives.org/ore/1.0/datamodel (e.g. the MUSTs in sections 3, 
4).  I
know that some are covered explicitly by Foresite (e.g. the protocol 
requirement in
URIs) but it seems that others are not (they are implicitly covered by Jena).

That said, what I did as a hack was to add the following to 
JenaOREParser.ResourceMap
parse(InputStream is):
Model model = this.parseToModel( is );
// Serialize the model to a null output stream
model.write( new OutputStream()
            {
                @Override
                public void write( int b ) throws IOException
                {
                    // do nothing
                }
            } );
...

I don't think this is the best solution, I'm not sure what the performance
implications are.  I'll think on this some more.  I'm wondering if there are 
some
options to pass to the Jena ARP parser which may help, but I haven't explored in
detail: http://jena.sourceforge.net/ARP/standalone.html
http://jena.sourceforge.net/IO/iohowto.html#input

Original comment by emets...@gmail.com on 15 Dec 2009 at 4:58