dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
84 stars 22 forks source link

Add possibility to combine several type systems #108

Closed ramonziai closed 4 years ago

ramonziai commented 4 years ago

Is your feature request related to a problem? Please describe.

I was looking for a way to import my own typesystem which builds on top of DKPro types. However, currently there is only the possibility to deserialize one XML file into one TypeSystem object, and references to other type systems in the XML files are ignored.

Describe the solution you'd like

I'd like a way to combine my type system and others (e.g. the DKPro type system).

Describe alternatives you've considered

The long way around would be to load both type systems into separate objects, then traverse all types of one of them and add them to the other. This seems clumsy and error-prone.

Additional context

ramonziai commented 4 years ago

Pull request #107 implements the option to pass an already existing type system to the deserializer. @jcklie rightly objected that the conceptually more correct way of achieving this is to add a merge() function for type systems. However, it seems to me that this involves rather more code than the change I submitted, but I might be wrong. At the end of the day, it's not that important to me how this is implemented, so I leave it to the maintainers to decide :-)

reckart commented 4 years ago

I would agree with @jcklie - instead of chaining one typesystem into the next via the deserializer, it would be better to have a method which takes multiple type system descriptions and merges them.

ramonziai commented 4 years ago

Ok, I can see that this is the favored approach, and I understand why. I'm interested in getting this functionality in there soon, so I'd be willing to put in the work. I'm assuming merge() would basically traverse the types and features of one type system, create clones with identical values and add them to the other type system. Is that correct or is there a better way?

jcklie commented 4 years ago

I think that is the way. One also needs to make sure that redefines are identical and that inheritance stays correct. I wanted to have a look at it today and over the weekend if that is fast enough for you.

ramonziai commented 4 years ago

Yes it is, thanks a lot :-)

reckart commented 4 years ago

UIMA has the concept of type merging.

So your input are n type systems which are all not modified during the merge process.

The output is a new type system.

The merging process needs to enter into every individual type. If a type is defined in two source file systems, then the features of all of the these types are joined together in the target type system.

If a feature is defined in both and it is not equal in both (e.g. an integer in one and a float in the other), then an error is generated.

Likewise, if the inheritance of the types differs across type systems.

Official documentation on type merging is here: https://uima.apache.org/d/uimaj-2.10.4/references.html#ugr.ref.cas.typemerging

reckart commented 4 years ago

The relevant method in UIMA is : org.apache.uima.util.CasCreationUtils.mergeTypeSystems(Collection<? extends TypeSystemDescription>) code

jcklie commented 4 years ago

@ramonziai Do you have an example type system that should be merged with DKPro?

ramonziai commented 4 years ago

@jcklie Here's a (part of the) type system I use, with some references to DKPro types in it: https://unitc-my.sharepoint.com/:u:/g/personal/nnszi01_cloud_uni-tuebingen_de/EZe4a17Bs-xDpsEyASsOOp4Ba72FKDw8F8g94WNDGRxI6w?e=oAdQAa

jcklie commented 4 years ago

@ramonziai I implemented the merging logic from uimaj. I will try to release a new version this or next week. You can just use the master via pip using python -m pip install git+https://github.com/dkpro/dkpro-cassis . Please close this issue if it works for you.

ramonziai commented 4 years ago

@jcklie Thanks a lot, merging seems to work just fine. Closing issue.