bridgedb / BridgeDb

The BridgeDb Library source code
https://bridgedb.org/
Apache License 2.0
28 stars 21 forks source link

Tolerate loading empty linkset files into IMS #62

Open randykerber opened 7 years ago

randykerber commented 7 years ago

Currently, when attempt to load a linkset file into IMS, if the linkset file is empty an Exception is thrown and the load terminates. emty means there are zero links in the linkset file. Here is contents of such a file:

randykerber commented 7 years ago

file: PDB/LINKSET_EXACT_PDB20171110.ttl

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix ops: <http://chemistry.openphacts.org//> .
@prefix : <http://chemistry.openphacts.org/download/20171110/PDB/LINKSET_EXACT_PDB20171110.ttl#> .
<http://chemistry.openphacts.org/download/20171110/PDB/LINKSET_EXACT_PDB20171110.ttl> void:inDataset <http://chemistry.openphacts.org/download/20171110/void_2017-11-10.ttl#pdb_exactMatch> .
randykerber commented 7 years ago

The only triple in the file is a void:inDataset statement. There are no "link" triples.

randykerber commented 7 years ago

I remember running into this issue earlier in the year. When I looked into it I recall the Exception was caused by some sort of pre-check phase performed on each file, where some piece of code attempted to read the first link triple/line of the file and verify that it had the correct link-predicate (and perhaps other checks). However, when there was no "first record", it resulted in an Exception. I could not see a way to fix without possibly wrecking other functionality.

This was probably 6+ months ago, so memory is a bit fuzzy.

randykerber commented 7 years ago

The correct behavior is probably to issue some sort of WARNING message that an empty linkset file was encountered, but to otherwise allow the process to proceed. Basically, an empty linkset file should be allowed and proceed with the load.

AlasdairGray commented 7 years ago

Can you justify why we should allow loading an empty linkset file?

It is a deliberate quality assurance step that the loader performs these test.

egonw commented 7 years ago

We were thinking about automating things more, and in an automated process it's easy enough to have a data set where there are no links... they found three such link sets right now... is an empty link set semantically wrong?

AlasdairGray commented 7 years ago

We took the view that it is wrong. You have metadata saying there are links between these two data sources but then there are no links.

As such, we wanted the automated loading process to fail so that we would have to fix the issue rather than just seeing a bunch of warnings that are easy to ignore and pass over.

stain commented 7 years ago

Agree with @AlasdairGray that a linkset with no links is not really a linkset (and so should not declare itself as that), and a warning would be easily ignored (as IMS has loads of them) - and so you would no longer detect the probably more common case that the void declares the wrong void:linkPredicate.

But perhaps it could be allowed if the void also says void:triples 0?

egonw commented 7 years ago

I'm in favor of allowing empty sets. I don't think the IMS loading is the right place or time to complain about this. The IMS function does not break after loading an empty set (which is still a set). The predicate cannot be wrong either, with zero links, so an exception for that sounds semantically wrong to me.

I see a link set as the result of a calculation (process) to determine the links. In fact, for many of our link sets these are in fact calculated. For others, the process is an analysis of an external data set...

An explicit claim that given that process that there are zero links makes a lot of sense to me. Without any VoID we also loose the observation that there were no links.

AlasdairGray commented 7 years ago

I think your final point is more about the outcome of the linking process rather than the use of the linksets.

I would argue that from the use of linksets, having a set with no links doesn't make sense and will have an impact on the performance of the system, as we could easily generate thousands of linksets with no links, e.g. UniProt:Protein to aers:AdverseEvent.

stain commented 7 years ago

I don't think performance will be an issue, but there might be an issue if the empty linkset causes a false transitive path possibility. Not tested!

But we can just add an --ignore-empty flag then and see how it goes? To the command line or loader XML file?

AlasdairGray commented 7 years ago

I was not referring the performance of the loader, but of the computation of links.

@egonw can you give a concrete use case where you have a linkset with no links that should be loaded into the IMS?

egonw commented 7 years ago

@AlasdairGray, Randy/Ian are trying to finish the OPS 2.2 API release and running into empty link sets. Since many link sets are autogenerated, as part of a update release flow, some end up empty. See the aforementioned example by Randy...

AlasdairGray commented 7 years ago

From the data governance perspective, I think it is better that we weed out the empty linksets as part of the loading process. If they are truely meant to be empty then we don't need to load them, they would generate no query answers. If they are not meant to be empty then we need to find out why they are empty.

egonw commented 6 years ago

@randykerber, what command line call did you make on the above (empty) link set that we can use to reproduce the exception? (And thus, test the --ignore-empty option to be implemented...)

randykerber commented 6 years ago

The load command line would have used a load.xml that loaded all openphacts linksets (more than 100). I'm guessing you want something more concise?

The second entry in this issue has a complete linkset file that triggered this exception. I can fetch the void file that goes along with this. That enough?

randykerber commented 6 years ago

I've attached the void file (below) that references the zero-link linkset listed in the second entry of this 'issues' page.

I had to rename the file to void_2017-11-10.ttl.txt, else GitHub would not accept it. Should be named void_2017-11-10.ttl.

void_2017-11-10.ttl.txt

egonw commented 6 years ago

Stacktrace:

Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.simontuffs.onejar.Boot.run(Boot.java:340)
        at com.simontuffs.onejar.Boot.main(Boot.java:166)
Caused by: org.bridgedb.utils.BridgeDBException: Error loading 
        at uk.ac.manchester.cs.openphacts.ims.loader.RunLoader.main(RunLoader.java:241)
        ... 6 more
Caused by: org.bridgedb.utils.BridgeDBException: Found 0 where 1 expected
        at uk.ac.manchester.cs.openphacts.ims.loader.LinksetLoader.getObjectURI(LinksetLoader.java:169)
        at uk.ac.manchester.cs.openphacts.ims.loader.LinksetLoader.getParser(LinksetLoader.java:132)
        at uk.ac.manchester.cs.openphacts.ims.loader.LinksetLoader.load(LinksetLoader.java:91)
        at uk.ac.manchester.cs.openphacts.ims.loader.LinksetLoader.load(LinksetLoader.java:69)
        at uk.ac.manchester.cs.openphacts.ims.loader.RunLoader.loadLinkset(RunLoader.java:102)
        at uk.ac.manchester.cs.openphacts.ims.loader.RunLoader.main(RunLoader.java:225)
        ... 6 more
egonw commented 6 years ago

OK, turns out that to test this, I need to bake at least a new .war and possibly a new Docker... I am working on this, but Tina will likely want to make a new PathVisio release and Anders a new pvjs release, so pushing this to the next release... which will, in all fairness, follow really soon... Oh, and Jonathan is waiting for a release too... #rsro