apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.08k stars 646 forks source link

rdfdiff finds too many differences #2285

Open justin2004 opened 5 months ago

justin2004 commented 5 months ago

Version

5.0.0-rc1

What happened?

rdfdiff finds more differences than there are between two files. the case below shows two files that are isomorphic except that one has 1 additional triple but rdfdiff finds several other differences between the files.

$ cat issue.ttl 
@prefix ex: <http://example.com/> .                                                                                                            

ex:region_10
        a ex:Region ;                                                                                                                          
        ex:related [
                ex:memberList (                                                                                                                
                        ex:region_11                                                                                                           
                ) ;                                                                                                                            
        ] ;                                                                                                                                    
        .                                                                                                                                      

$ cat issue1.ttl                                                                                 
@prefix ex: <http://example.com/> .                                                                                                            

ex:region_10                                                           
        a ex:Region ;                                                  
        ex:related [
                ex:memberList (
                        ex:region_11
                ) ;
        ] ;
        .

$ ~/Downloads/apache-jena-5.0.0-rc1/bin/rdfdiff issue.ttl issue1.ttl ttl ttl
models are equal        

# that was as expected
# but now when i add a single triple to one of the files:

$ cat issue1.ttl 
@prefix ex: <http://example.com/> .

ex:region_16 a ex:Region .

ex:region_10
        a ex:Region ;
        ex:related [
                ex:memberList (
                        ex:region_11
                ) ;
        ] ;
        .
$ ~/Downloads/apache-jena-5.0.0-rc1/bin/rdfdiff issue.ttl issue1.ttl ttl ttl
models are unequal

< 5 triples
> 6 triples
< [http://example.com/region_10, http://example.com/related, _:2f46e8a27dca00569d3d04d42c3f3c53]
< [_:2f46e8a27dca00569d3d04d42c3f3c53, http://example.com/memberList, _:f0c1b4df8bd882e0dc926aee91d36315]
< [_:f0c1b4df8bd882e0dc926aee91d36315, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, http://www.w3.org/1999/02/22-rdf-syntax-ns#nil]
< [_:f0c1b4df8bd882e0dc926aee91d36315, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, http://example.com/region_11]
> [_:d310f6311ac18f96e2c81f63007497e6, http://example.com/memberList, _:fe69945f1dff1961864044d0b5a4c756]
> [http://example.com/region_16, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://example.com/Region]
> [http://example.com/region_10, http://example.com/related, _:d310f6311ac18f96e2c81f63007497e6]
> [_:fe69945f1dff1961864044d0b5a4c756, http://www.w3.org/1999/02/22-rdf-syntax-ns#rest, http://www.w3.org/1999/02/22-rdf-syntax-ns#nil]
> [_:fe69945f1dff1961864044d0b5a4c756, http://www.w3.org/1999/02/22-rdf-syntax-ns#first, http://example.com/region_11]

# but i only expected a single triple difference between the two files

i expected output something like this:

models are unequal

< 0 triples
> 1 triples
> [http://example.com/region_16, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://example.com/Region]

I got the same results on jena 4.10.x.

Relevant output and stacktrace

No response

Are you interested in making a pull request?

None

afs commented 5 months ago

See the comment in https://github.com/apache/jena/commit/a56fa1f Code: https://github.com/apache/jena/blob/main/jena-cmds/src/main/java/arq/rdfdiff.java

The code has not changed in quite sometime.

It's printing the two files.

FWIW I think finding a minimal difference of two unordered collections with bnode isomorphism is quite a difficult problem. Even plain text diff can find produce non-optimal difference files. Hope to be proved wrong.