LanguageMachines / libfolia

FoLiA library for C++
https://proycon.github.io/folia
GNU General Public License v3.0
15 stars 7 forks source link

allow for multiple foreign metadata nodes in FoLiA, even in 'native' mode. #46

Closed kosloot closed 3 years ago

kosloot commented 3 years ago

Use case: we have a file where we want to add a foreign metadata node:

  <metadata type="foreign">
     <foreign-data>
        <paragraphStyles xmlns="http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml">
...
    </foreign-data>
  </metadata>

This works. But we also would like to add:

 <metadata type="native">
   <meta id="abby_file">piroska.abby.xml</meta>
  </metadata>

But the current implementation doesn't allow this. Is this an oversight? I assume submetadata is meant for this? It seems the API doesn't provide an easy way.

proycon commented 3 years ago

There can indeed only one metadata block currently, which holds the main document metadata. There can be many submetadata elements as you want, with ideally an explicitly defined span (elements pointing back using their metadata attribute). In this case I guess it's acceptable to use submetadata, probably pointing back to it from specific paragraphs?

We may at once point have thought about multiple foreign-data elements, allowing people to use multiple schemes at the same time, I'm not sure if that's currently allowed or not, it's not explicitly documented either way. Same for mixing the foreign data with the native one, our specification doesn't say much about that yet, so we could still go that way if deemed needed. (It does make things a bit messy though).

kosloot commented 3 years ago

Still I feel uncomfortable about (ab)-using submetadata to store trivia likt the filename, genre, and language. It also means that you have (a little) more work retrieving such info. Maybe it is cleaner to allow for 1 foreign and 1 native metadata field?

In hindsight having something like this woule have been better?:

<metadata>
  <native_metadata/>
  <foreign_metadata/>
</metadata>

preferring nodes over attributes. Something for FoLiA 3? It might the current implementation a bit cleaner too, i assume

proycon commented 3 years ago

I see your point yes. We could see whether we can combine the two without any extra elements, perhaps introducing a type="mixed", in that case we should also explicitly allow multiple foreign-data elements.

<metadata type="mixed">
  <meta id="foo">bar</meta>
  <foreign-data/>
</metadata>
kosloot commented 3 years ago

Hmm, but whats the point of having a type anyway? Why doesn't just metadata work, with optional foreign-data children? Would it break something essential?

proycon commented 3 years ago

Initially the idea of the type was to make clear what kind of metadata could be expected, if we don't have an explicit type then parsers have to deduce the type by checking what kinds of sub-elements there are, making things a little bit more difficult. If the type is not specified it defaults to "native" in foliapy.

I checked the current implementation in foliapy, I do allow multiple foreign-data elements there, but not mixing the two, perhaps we should indeed just allow foreign-data even if the type is "native".

(I'm transferring this issue to folia instead of libfolia)

proycon commented 3 years ago

(I'm transferring this issue to folia instead of libfolia)

Oh, guess Github doesn't support hat because the repos are not under the same user.. silly..

kosloot commented 3 years ago

Sounds OK. So we allow multiple foreign-data in a metadata node, And the type is in fact ignored. (assuming that a type of "foreign" no longer blocks adding native attribute-value pairs, as the current libfolia does.)

Also be careful: besides 'native' and 'foreign' we have external metadata, identified by a 'src' attribute. (not by a type?. hmm)

kosloot commented 3 years ago

Ok, only a small change in libfolia was needed to allow for mixing off 'native', 'foreign' and 'external' metadata. libfolia already handled multiple foreign-data structures. The type attribute really seems redundant now. @proycon I suggest to start ignoring it.

@proycon I stil have some doubt about the ExternalMetadata. can those be mixed in also? And if so, how many times?

@proycon I couldn't find this issue under https://github.com/proycon/folia/issues/ to comment there

proycon commented 3 years ago

Right now the external metadata is derived from the src attribute on <metadata>, therefore there can only be one. But in principle I guess we should allow multiple ExternalMetadata elements, but then we'll need a different serialisation, perhaps something like:

<metadata>
    <external-metadata src="A"  type="a" />
    <external-metadata src="B" type="b"/>
</metadata>

We could move the type attribute there then.

Of course backward compatibility with the current method should be retained, making the following two excerpts semantically equivalent:

<metadata src="A" type="a">
</metadata>
<metadata>
    <external-metadata src="A" type="a" />
</metadata>

I'm a bit wary of the extra complexity though.

kosloot commented 3 years ago

My first rather simplistic implementation of this enhancement in libfolia still allows for only one ExternalMetaData. Is this really a problem, I wonder? I don't know of any real use-case. So it's hard to ponder about it.

My first reaction is to leave this unchanged, and keep backwards compatibility. There is always the escape of sub-metadata. BTW, these use a type attribute too, and don't allow for "mixing" Foreign, Native and External. But I don't thimk there is problem while there may be any number of them already.

proycon commented 3 years ago

Yeah, I don't have a use case yet either so let's just leave that for now and focus on mixing only Foreign and Native.

kosloot commented 3 years ago

I consider this done. Implemented in libfolia 2.7

pirolen commented 3 years ago

FLAT complains upon upload: (perhaps I need to still update something. I now uodated to FoLiA-Linguistic-Annotation-Tool-0.9.4)

Uploaded file is no valid FoLiA Document: Encountered a foreign-data element but metadata type is native!Traceback (most recent call last): -- File "/home/flatuser/flateditor/env/lib/python3.7/site-packages/foliadocserve/foliadocserve.py", line 940, in upload -- doc = folia.Document(string=data,setdefinitions=self.docstore.setdefinitions, loadsetdefinitions=True, autodeclare=True, allowadhocsets=True, processor=mainprocessor) -- File "/home/flatuser/flateditor/env/lib/python3.7/site-packages/folia/main.py", line 7188, in __init__ -- self.parsexml(self.tree.getroot()) -- File "/home/flatuser/flateditor/env/lib/python3.7/site-packages/folia/main.py", line 8391, in parsexml -- self.parsemetadata(subnode) -- File "/home/flatuser/flateditor/env/lib/python3.7/site-packages/folia/main.py", line 8252, in parsemetadata -- raise MetaDataError("Encountered a foreign-data element but metadata type is native!") -- folia.main.MetaDataError: Encountered a foreign-data element but metadata type is native!

kosloot commented 3 years ago

Doe you have the offensive file at hand? Is is made using FoLiA-abby?

pirolen commented 3 years ago

It's the usual files, from FoLiA-abby.

kosloot commented 3 years ago

Ok, those files are accepted by both folialint AND foliavalidator, so there must be a problem within Flat or your configuration

pirolen commented 3 years ago

I tried with two different FLAT installations, both updated to FoLiA-Linguistic-Annotation-Tool-0.9.4, but might need to update something still?

proycon commented 3 years ago

If it's a LaMachine development installation, then it should have the latest versions. Stable can be a bit behind but this is something that has been released and should work. If you installed FLAT manually then you may need to also do: pip3 upgrade -U folia foliatools. Our public FLAT instance in Nijmegen may also be a little bit behind still.

pirolen commented 3 years ago

Both are manual installations (one with the django dev server, the other with nginx). pip3 install -U, rather? That would upgrade folia, but not foliatools:

pip3 install -U folia foliatools Requirement already satisfied: folia in ./flatenv/lib/python3.7/site-packages (2.4.1) Collecting folia Downloading FoLiA-2.4.6.tar.gz (193 kB) |████████████████████████████████| 193 kB 7.3 MB/s ERROR: Could not find a version that satisfies the requirement foliatools ERROR: No matching distribution found for foliatools

pirolen commented 3 years ago

I mean: ERROR: unknown command "upgrade"

proycon commented 3 years ago

Sorry, my mistake! I was a bit too quick in posting my reply and messed up the syntax, it should indeed be pip3 install -U. It should also be folia-tools (with a hypen) instead of what I said earlier.

pirolen commented 3 years ago

Yay! It worked!

pirolen commented 3 years ago

How can the metadata elements be accessed with foliapy? Calling doc.metadata only returns this: ['abby_file']

Guess it is the NativeMetaData object I'm after. E.g. I would at some point like to remove the foreign-data node (with all the baroque Abbyy style information). If I remove them from Abbyy XML directly, then FoLiA-abby breaks.

proycon commented 3 years ago

in foliapy, doc.metadata should return the NativeMetaData object, which you can just treat like a normal dictionary:

for key, value in doc.metadata.items():
   ...

The metadata object also has a `next attribute in case there is more metadata (like foreign metadata). You can remove it by setting doc.metadata.next = None and then saving.

pirolen commented 3 years ago

Thanks! Removing worked.

doc.metadata should return the NativeMetaData object

Iterating over doc.metadata.items() still only returns the single item that the <meta> element holds, e.g. printing (k,v) returns ('abby_file', 'b1_3_1_mwtext_ostpreuss_pp109_277_001_abpproc.xml')

proycon commented 3 years ago

Yes? Is that not what you'd expect? You say there's only one meta item right?

pirolen commented 3 years ago

Yes, there is one meta item. Just that I thought that one would by .items() access the children of <metadata type="native">, incl. <foreign-data> etc. But in any case, its removal was the goal, and with the next attribute it works like a charm :-)

kosloot commented 3 years ago

@pirolen

E.g. I would at some point like to remove the foreign-data node (with all the baroque Abbyy style information).
If I remove them from Abbyy XML directly, then FoLiA-abby breaks.

Well, those foreign data are added on your special request, but OK. Still FoLiA-abby shouldn't break if you remove them. (although, having NO metadata at all could be a problem) can you send me an example file?

proycon commented 3 years ago

Just that I thought that one would by .items() access the children of , incl.

Nope, you can indeed only access the meta elements that way. If you want to access the foreign data, you have to go through the next attribute which may give you a ForeignData element. That one in turn has a node attribute that simply returns the very raw XML node associated with the foreign data (since it's foreign you're on your own and the library can't do any particular parsing for you here besides aside from the normal XML stuff)

pirolen commented 3 years ago

Well, those foreign data are added on your special request

I know, and thanks very much for it! They might be needed, but when end users get the files, they are somewhat shocked ;-) so it's good to get it cleaned up.

Still FoLiA-abby shouldn't break if you remove them. (although, having NO metadata at all could be a problem)

I tried deleting all or less. One problem could be that when I remove nodes from abbyy (with BeautifulSoup...), this leaves behind empty lines. I have been doing such postprocessing in the body of the Abbyy XML regularly, and FoLiA-abby thankfully did not choke on them. But in the metadata section it does. I attach an example.

vorbemtest_abpproc.xml.txt

kosloot commented 3 years ago

Fixed handling of Abbyy files without real paragraphStyle info

pirolen commented 3 years ago

Thank you very much!