apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.12k stars 653 forks source link

RDFDataMgr can't read back what it writes #1879

Closed ebremer closed 1 year ago

ebremer commented 1 year ago

Version

4.8.0

What happened?

I have a URL with a space it it which originally came from ORCID web site. The following code will show the error that gets created when trying to read it:

        String bad = "https://dial.uclouvain.be/pr/boreal/search/site/sm_creator:\"Van de Ven, Annelies\"";
        System.out.println(bad);
        Resource r = ResourceFactory.createResource(bad);
        System.out.println(bad);
        Model m = ModelFactory.createDefaultModel();
        m.add(r,RDF.type,FOAF.Person);
        RDFDataMgr.write(System.out,m,Lang.TURTLE);
        RDFDataMgr.write(new FileOutputStream("data.ttl"),m,Lang.TURTLE);
        Model x = ModelFactory.createDefaultModel();
        RDFDataMgr.read(x, new FileInputStream("data.ttl"), Lang.TURTLE);

All works fine if the spaces are swapped with %20. Shouldn't createResource() throw an error if the created URI would be bad?

Relevant output and stacktrace

https://dial.uclouvain.be/pr/boreal/search/site/sm_creator:"Van de Ven, Annelies"
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
https://dial.uclouvain.be/pr/boreal/search/site/sm_creator:"Van de Ven, Annelies"
<https://dial.uclouvain.be/pr/boreal/search/site/sm_creator:"Van de Ven, Annelies">
        a       <http://xmlns.com/foaf/0.1/Person> .
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 1, col: 66] Bad character in IRI (space): <https://dial.uclouvain.be/pr/boreal/search/site/sm_creator:"Van[space]...>
    at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:160)
    at org.apache.jena.riot.tokens.TokenizerText.error(TokenizerText.java:1336)
    at org.apache.jena.riot.tokens.TokenizerText.readIRI(TokenizerText.java:537)
    at org.apache.jena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:197)
    at org.apache.jena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:93)
    at org.apache.jena.atlas.iterator.PeekIterator.fill(PeekIterator.java:50)
    at org.apache.jena.atlas.iterator.PeekIterator.<init>(PeekIterator.java:44)
    at org.apache.jena.riot.lang.LangEngine.<init>(LangEngine.java:48)
    at org.apache.jena.riot.lang.LangBase.<init>(LangBase.java:31)
    at org.apache.jena.riot.lang.LangTurtleBase.<init>(LangTurtleBase.java:59)
    at org.apache.jena.riot.lang.LangTurtle.<init>(LangTurtle.java:35)
    at org.apache.jena.riot.lang.RiotParsers.createParserTurtle(RiotParsers.java:99)
    at org.apache.jena.riot.lang.RiotParsers.createParser(RiotParsers.java:57)
    at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:202)
    at org.apache.jena.riot.RDFParser.read(RDFParser.java:416)
    at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:406)
    at org.apache.jena.riot.RDFParser.parse(RDFParser.java:356)
    at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:570)
    at org.apache.jena.riot.RDFDataMgr.parseFromInputStream(RDFDataMgr.java:718)
    at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:253)
    at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:220)
    at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:206)
    at com.ebremer.orcid.NewClass.main(NewClass.java:35)
Command execution failed.
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
    at org.apache.commons.exec.DefaultExecutor.executeInternal (DefaultExecutor.java:404)
    at org.apache.commons.exec.DefaultExecutor.execute (DefaultExecutor.java:166)
    at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:1000)
    at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:947)
    at org.codehaus.mojo.exec.ExecMojo.execute (ExecMojo.java:471)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137)
    at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 (MojoExecutor.java:370)
    at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute (MojoExecutor.java:351)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:171)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:163)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:298)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:960)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:77)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:568)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)

Are you interested in making a pull request?

None

afs commented 1 year ago

See also https://issues.apache.org/jira/browse/JENA-2351

afs commented 1 year ago

" is illegal as well. Jena just happens to be a bit more fussy about spaces because of all the problems they cause.

The trouble with generating reports on write is that streaming can either warn, and keep going, or error, leaving partial output. Both can go unnoticed.

ebremer commented 1 year ago

Thanks Andy

afs commented 1 year ago

I'm not saying "do nothing" - it is that the choice of what to do is not straight forward.

Normally the data is rejected in reading in - how did the data originally get in in this case?

ebremer commented 1 year ago

It was pulled from https://orcid.org/0000-0003-3039-2116 using Accept header "text/turtle" and then saved to file. The ORCID people appear to be exporting it via Jena (4.4.0 I think) See: https://github.com/ORCID/ORCID-Source When I was reading this ttl file (with many others) back into a in-memory Model with RDFDataMgr the error was thrown. To deal with it for now, I made a custom ErrorHandler and used the below (instead of RDFDataMgr):

                    ERR err = new ERR();
                    RDFParser parser = RDFParserBuilder.create()
                        .checking(true)
                        .strict(true)
                        .forceLang(Lang.TURTLE)
                        .source(fis)
                        .errorHandler(err)
                        .build();

If ERR detects an error, I used it to trigger another method to check any Object URIs for that graph import for bad characters and swapped them for a corrected versions. It's sufficient for now as I am just currently studying ORCIDs API for the moment.