LanguageMachines / libfolia

FoLiA library for C++
https://proycon.github.io/folia
GNU General Public License v3.0
15 stars 7 forks source link

textclass properties on entities not honoured when interpreting wref/@t #21

Open proycon opened 6 years ago

proycon commented 6 years ago

folialint breaks on the following document with error (foliavalidator does not complain):

XML error: WordRefence id=TEI.1.text.1.body.1.div1.1.head.1.s.1.w.3 has another value for  the t attribute them it's reference. (Zuidhollanschen versus Zuydthollanschen)

It should look in the right textclass, which is explicitly specified at the entity level.

'Minimal' FoLiA example (http://lst.science.ru.nl/~proycon/issue52.folia.xml):

    <s xml:id="TEI.1.par">                                                                                                                                                                                                                                     
            <w xml:id="TEI.1.text.1.body.1.div1.1.head.1.s.1.w.3" class="WORD" set="tokconfig-nld">                                                                                                                                                            
              <t>Zuydthollanschen</t>                                                                                                                                                                                                                          
              <t class="contemporary">Zuidhollanschen</t>                                                                                                                                                                                                      
              <pos class="SPEC(deeleigen)" confidence="1" head="SPEC" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" textclass="contemporary">                                                                                                              
                <feat class="deeleigen" subset="spectype"/>                                                                                                                                                                                                    
              </pos>                                                                                                                                                                                                                                           
              <lemma class="Zuidhollanschen" set="http://ilk.uvt.nl/folia/sets/frog-mblem-nl" textclass="contemporary"/>                                                                                                                                       
            </w>                                                                                                                                                                                                                                               
            <w xml:id="TEI.1.text.1.body.1.div1.1.head.1.s.1.w.4" class="WORD" set="tokconfig-nld" space="no">                                                                                                                                                 
              <t>Synodi</t>                                                                                                                                                                                                                                    
              <t class="contemporary">Sijnodi</t>                                                                                                                                                                                                              
              <pos class="SPEC(deeleigen)" confidence="1" head="SPEC" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" textclass="contemporary">                                                                                                              
                <feat class="deeleigen" subset="spectype"/>                                                                                                                                                                                                    
              </pos>                                                                                                                                                                                                                                           
              <lemma class="Sijnodi" set="http://ilk.uvt.nl/folia/sets/frog-mblem-nl" textclass="contemporary"/>                                                                                                                                               
            </w>                                                                                                                                                                                                                                               
            <entities xml:id="TEI.1.text.1.body.1.div1.1.head.1.s.1.entities.1">                                                                                                                                                                               
              <entity xml:id="TEI.1.text.1.body.1.div1.1.head.1.s.1.entities.1.entity.1" class="pro" confidence="0.68202" set="http://ilk.uvt.nl/folia/sets/frog-ner-nl" textclass="contemporary">                                                             
                <wref id="TEI.1.text.1.body.1.div1.1.head.1.s.1.w.3" t="Zuidhollanschen"/>                                                                                                                                                                     
                <wref id="TEI.1.text.1.body.1.div1.1.head.1.s.1.w.4" t="Sijnodi"/>                                                                                                                                                                             
              </entity>                                                                                                                                                                                                                                        
            </entities>                                                                                                                                                                                                                                        
    </s>     
proycon commented 6 years ago

(Resolution needed for completion of INL/nederlab-linguistic-enrichment#12)

kosloot commented 6 years ago

Ok, the error is detected when parsing the wref node, and before appending it to the layer. So the textclass of the layer is yet unknown. ( It uses the textclass of the referenced Word, which is wrong indeed) Probably the check has to be postponed to the post_append() method?

kosloot commented 6 years ago

A good solution is not easy. For the moment, this check is disabled.

kosloot commented 4 years ago

The check is disabled. But should ideally be performed at some stage. So it keep the issue as an enhancement.