Closed kosloot closed 3 years ago
There can indeed only one metadata block currently, which holds the main document metadata. There can be many submetadata elements as you want, with ideally an explicitly defined span (elements pointing back using their metadata attribute). In this case I guess it's acceptable to use submetadata, probably pointing back to it from specific paragraphs?
We may at once point have thought about multiple foreign-data elements, allowing people to use multiple schemes at the same time, I'm not sure if that's currently allowed or not, it's not explicitly documented either way. Same for mixing the foreign data with the native one, our specification doesn't say much about that yet, so we could still go that way if deemed needed. (It does make things a bit messy though).
Still I feel uncomfortable about (ab)-using submetadata to store trivia likt the filename, genre, and language. It also means that you have (a little) more work retrieving such info. Maybe it is cleaner to allow for 1 foreign and 1 native metadata field?
In hindsight having something like this woule have been better?:
<metadata>
<native_metadata/>
<foreign_metadata/>
</metadata>
preferring nodes over attributes. Something for FoLiA 3? It might the current implementation a bit cleaner too, i assume
I see your point yes. We could see whether we can combine the two without any extra elements, perhaps introducing a type="mixed"
, in that case we should also explicitly allow multiple foreign-data elements.
<metadata type="mixed">
<meta id="foo">bar</meta>
<foreign-data/>
</metadata>
Hmm, but whats the point of having a type anyway? Why doesn't just metadata work, with optional foreign-data children? Would it break something essential?
Initially the idea of the type was to make clear what kind of metadata could be expected, if we don't have an explicit type then parsers have to deduce the type by checking what kinds of sub-elements there are, making things a little bit more difficult. If the type is not specified it defaults to "native" in foliapy.
I checked the current implementation in foliapy, I do allow multiple foreign-data elements there, but not mixing the two, perhaps we should indeed just allow foreign-data even if the type is "native".
(I'm transferring this issue to folia instead of libfolia)
(I'm transferring this issue to folia instead of libfolia)
Oh, guess Github doesn't support hat because the repos are not under the same user.. silly..
Sounds OK. So we allow multiple foreign-data in a metadata node, And the type is in fact ignored. (assuming that a type of "foreign" no longer blocks adding native attribute-value pairs, as the current libfolia does.)
Also be careful: besides 'native' and 'foreign' we have external metadata, identified by a 'src' attribute. (not by a type?. hmm)
Ok, only a small change in libfolia was needed to allow for mixing off 'native', 'foreign' and 'external' metadata. libfolia already handled multiple foreign-data structures. The type attribute really seems redundant now. @proycon I suggest to start ignoring it.
@proycon I stil have some doubt about the ExternalMetadata. can those be mixed in also? And if so, how many times?
@proycon I couldn't find this issue under https://github.com/proycon/folia/issues/ to comment there
Right now the external metadata is derived from the src
attribute on <metadata>
, therefore there can only be one. But in principle I guess we should allow multiple ExternalMetadata
elements, but then we'll need a different serialisation, perhaps something like:
<metadata>
<external-metadata src="A" type="a" />
<external-metadata src="B" type="b"/>
</metadata>
We could move the type
attribute there then.
Of course backward compatibility with the current method should be retained, making the following two excerpts semantically equivalent:
<metadata src="A" type="a">
</metadata>
<metadata>
<external-metadata src="A" type="a" />
</metadata>
I'm a bit wary of the extra complexity though.
My first rather simplistic implementation of this enhancement in libfolia still allows for only one ExternalMetaData. Is this really a problem, I wonder? I don't know of any real use-case. So it's hard to ponder about it.
My first reaction is to leave this unchanged, and keep backwards compatibility. There is always the escape of sub-metadata. BTW, these use a type attribute too, and don't allow for "mixing" Foreign, Native and External. But I don't thimk there is problem while there may be any number of them already.
Yeah, I don't have a use case yet either so let's just leave that for now and focus on mixing only Foreign and Native.
I consider this done. Implemented in libfolia 2.7
FLAT complains upon upload: (perhaps I need to still update something. I now uodated to FoLiA-Linguistic-Annotation-Tool-0.9.4)
Uploaded file is no valid FoLiA Document: Encountered a foreign-data element but metadata type is native!Traceback (most recent call last): -- File "/home/flatuser/flateditor/env/lib/python3.7/site-packages/foliadocserve/foliadocserve.py", line 940, in upload -- doc = folia.Document(string=data,setdefinitions=self.docstore.setdefinitions, loadsetdefinitions=True, autodeclare=True, allowadhocsets=True, processor=mainprocessor) -- File "/home/flatuser/flateditor/env/lib/python3.7/site-packages/folia/main.py", line 7188, in __init__ -- self.parsexml(self.tree.getroot()) -- File "/home/flatuser/flateditor/env/lib/python3.7/site-packages/folia/main.py", line 8391, in parsexml -- self.parsemetadata(subnode) -- File "/home/flatuser/flateditor/env/lib/python3.7/site-packages/folia/main.py", line 8252, in parsemetadata -- raise MetaDataError("Encountered a foreign-data element but metadata type is native!") -- folia.main.MetaDataError: Encountered a foreign-data element but metadata type is native!
Doe you have the offensive file at hand? Is is made using FoLiA-abby?
It's the usual files, from FoLiA-abby.
Ok, those files are accepted by both folialint AND foliavalidator, so there must be a problem within Flat or your configuration
I tried with two different FLAT installations, both updated to FoLiA-Linguistic-Annotation-Tool-0.9.4, but might need to update something still?
If it's a LaMachine development installation, then it should have the latest versions. Stable can be a bit behind but this is something that has been released and should work. If you installed FLAT manually then you may need to also do: pip3 upgrade -U folia foliatools
. Our public FLAT instance in Nijmegen may also be a little bit behind still.
Both are manual installations (one with the django dev server, the other with nginx). pip3 install -U, rather? That would upgrade folia, but not foliatools:
pip3 install -U folia foliatools Requirement already satisfied: folia in ./flatenv/lib/python3.7/site-packages (2.4.1) Collecting folia Downloading FoLiA-2.4.6.tar.gz (193 kB) |████████████████████████████████| 193 kB 7.3 MB/s ERROR: Could not find a version that satisfies the requirement foliatools ERROR: No matching distribution found for foliatools
I mean: ERROR: unknown command "upgrade"
Sorry, my mistake! I was a bit too quick in posting my reply and messed up the syntax, it should indeed be pip3 install -U
. It should also be folia-tools (with a hypen) instead of what I said earlier.
Yay! It worked!
How can the metadata elements be accessed with foliapy?
Calling doc.metadata
only returns this:
['abby_file']
Guess it is the NativeMetaData object I'm after. E.g. I would at some point like to remove the foreign-data node (with all the baroque Abbyy style information). If I remove them from Abbyy XML directly, then FoLiA-abby breaks.
in foliapy, doc.metadata
should return the NativeMetaData object, which you can just treat like a normal dictionary:
for key, value in doc.metadata.items():
...
The metadata
object also has a `next
attribute in case there is more metadata (like foreign metadata). You can remove it by setting doc.metadata.next = None
and then saving.
Thanks! Removing worked.
doc.metadata should return the NativeMetaData object
Iterating over doc.metadata.items() still only returns the single item that the <meta>
element holds, e.g. printing (k,v) returns
('abby_file', 'b1_3_1_mwtext_ostpreuss_pp109_277_001_abpproc.xml')
Yes? Is that not what you'd expect? You say there's only one meta item right?
Yes, there is one meta item.
Just that I thought that one would by .items() access the children of <metadata type="native">
, incl. <foreign-data>
etc.
But in any case, its removal was the goal, and with the next
attribute it works like a charm :-)
@pirolen
E.g. I would at some point like to remove the foreign-data node (with all the baroque Abbyy style information).
If I remove them from Abbyy XML directly, then FoLiA-abby breaks.
Well, those foreign data are added on your special request, but OK. Still FoLiA-abby shouldn't break if you remove them. (although, having NO metadata at all could be a problem) can you send me an example file?
Just that I thought that one would by .items() access the children of
, incl.
Nope, you can indeed only access the meta elements that way. If you want to access the foreign data, you have to go through the next
attribute which may give you a ForeignData element. That one in turn has a node
attribute that simply returns the very raw XML node associated with the foreign data (since it's foreign you're on your own and the library can't do any particular parsing for you here besides aside from the normal XML stuff)
Well, those foreign data are added on your special request
I know, and thanks very much for it! They might be needed, but when end users get the files, they are somewhat shocked ;-) so it's good to get it cleaned up.
Still FoLiA-abby shouldn't break if you remove them. (although, having NO metadata at all could be a problem)
I tried deleting all or less. One problem could be that when I remove nodes from abbyy (with BeautifulSoup...), this leaves behind empty lines. I have been doing such postprocessing in the body of the Abbyy XML regularly, and FoLiA-abby thankfully did not choke on them. But in the metadata section it does. I attach an example.
Fixed handling of Abbyy files without real paragraphStyle info
Thank you very much!
Use case: we have a file where we want to add a foreign metadata node:
This works. But we also would like to add:
But the current implementation doesn't allow this. Is this an oversight? I assume submetadata is meant for this? It seems the API doesn't provide an easy way.