Rothamsted / knetbuilder

KnetBuilder data integration platform for building knowledge graphs. Previously known as ondex.
https://knetminer.com
MIT License
12 stars 11 forks source link

TAB Parser - relation type merging #8

Closed KeywanHP closed 6 years ago

KeywanHP commented 7 years ago

Ondex graph is restricted to one type X relation between two nodes. What does the TAB parser do when a it has already created a relation between A-B and a new A-B relation occurs in the TAB file, however, which has different attributes?

We can discuss options here...

marco-brandizi commented 7 years ago

This is entirely up to the pre-existing parser, which the new tab parser uses. I've tested it (https://goo.gl/ohe7Bu, see testDupedRelation()) and it looks like it's buggy: it merges relations between two nodes (of the same type), but it misses attributes when doing so (in the example, there are 'Note 1' and 'Note 2' as attributes, in two lines and with the same pair of concepts, the final graph has only 'Note 1'). I think it's hard to debug, maybe a quicker thing to do is to use multiple columns for multiple attributes and avoid to spread them over multiple rows.

marco-brandizi commented 7 years ago

Another option could be to issue PathParser.setProcessingOptions( null ), which would make the parser to return a completely redundant graph (ie, for the rows about Note1 and Not2, there would be two relations, pointing to four concepts, with two accessions and two concepts per accession). Then, I could write a filter plug-in, to make a relation merge that could work better.

Opinions on this welcome.

marco-brandizi commented 7 years ago

I've added the merge options to the plug-in, so that it's now possible to pass these options to the underlining PathParser.

This is a multi-string parameter and can have the values MERGE_ON_ACCESSIONS, MERGE_NAME, MERGE_GDS (data sources). An empty/null value for this parameter is defaulted to MERGE_ON_ACCESSIONS, which previously was the value set by the CSV parserPathParser. The fact that empty/null values now correspond to this default ensures backward compatibility (ie, you can load workflow files that don't mention this parameter and get the same behaviour they had in the past).

With this parameter available is now possible to obtain a 'unmerged' graph from the tabular parser and then apply the mapping plugin 'lowmemoryaccessionbased', followed by the transformer 'relationcollapser'. This will first map equivalent nodes together and then merge them into one and, in doing so, the relations that were involved in the same nodes will be collapsed, with attributes merged. In particular, if two collapsed relations have two values for an attribute named X, an additional attribute X_1 will be created for the collapsed relation (the ONDEX data model doesn't support multi-value attributes).

Here an example of that: a file of gene->encodes->protein rows is imported, there are two pairs repeated twice, with two different values for 'Test Relation Attribute'.

@KeywanHP, @AjitPS , @Monika-Mistry, please let me know if this is fine for the use case at issue.

marco-brandizi commented 7 years ago

@KeywanHP, @AjitPS please confirm if this solution is fine, so that we can close this ticket.

AjitPS commented 7 years ago

@KeywanHP can we close this now?