ETCBC / bhsa

Hebrew Bible + Linguistic annotations in text-fabric format. Fixed and ongoing versions.
https://etcbc.github.io/bhsa/
MIT License
49 stars 21 forks source link

nametype feature not always implemented correctly (I assume) #5

Closed oliverglanz closed 4 years ago

oliverglanz commented 4 years ago

Bug/Problem In the bhsa feature description it says that "nametype" is a feature of the objectype "lex". This looks to be the case indeed when checking SHEBANQ. Jer 1:1 shows the presence of three nametype values (2x "pers", 1x "topo): Annotation 2020-09-26 114539

When running a MQL query in SHEBANQ that looks for the value "topo" of the feature "nametype" of the object-type "lex" in Jer 1:1 it should find "Anathot". But it doesn't (https://shebanq.ancient-data.org/hebrew/query?version=2017&id=3479). Instead it finds only a 8 words in all of Jeremiah (there should be more then 500 topos in Jer). "Anathot" in Jer 1:1 is not found, even though it has received the value "topo". The same happens when looking for "pers" in Jer - only 20 are founds while there should be more than 1000.

Annotation 2020-09-26 115837

A quick comparison with the bhsa TF app shows the same results: Annotation 2020-09-26 120149

However, in contrast to the feature description (https://etcbc.github.io/bhsa/features/nametype/) it seems that "nametype" is attached to the object-type word in the bhsa TF app where the accurate results can be retrieved:

Annotation 2020-09-26 120404

The linking of "nametype" with the object-type "word" was not done in SHEBANQ, however:

Annotation 2020-09-26 120626

Conclusion Only a very limited amount of "nametype" values are linked with the object-type "lex". This is true for both the bhsa TF app as well as SHEBANQ 2017. However, all "nametype" values are linked with the object-type "word" in the bhsa TF app.

Suggestion Change the official bhsa feature description and make "nametype" a feature of "word". This is already implemented in the bhsa TF app. The same should be done in SHEBANQ 2017.

dirkroorda commented 4 years ago

Thanks Oliver! Version 2017 cannot be changed, it is fixed version. The conclusion is right: 2017 does not have the feature nametype on word nodes, and that makes things more difficult. But version c has it as you wish.

See also https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/bhsa/cookbook/nametype.ipynb, written in response to Viktor Isaak earlier.

oliverglanz commented 4 years ago

Thank you, Dirk, for the clarification. That is helpful. So the fact that querying (in "bhsa C") "lex nametype" yields less results than "word nametype" has to do with the fact that the object type "lex" overlaps with the object type "word".

But then there is still a bug in the "C" version of bhsa in SHEBANQ. If version "C" has the "nametype" feature connected to the object type "word" SHEBANQ must run a different version than "C" when I run the MQL query in "C":

Screenshot 2020-09-26 211235

I am told that "nametype" is not a feature of "word"...

If running the same query in TF on bhsa "C" I get what I should get when searching "C":

Screenshot 2020-09-26 211626

dirkroorda commented 4 years ago

You spotted a discrepancy between the version c data as it is in github and as it is used by shebanq. I think I have added lex features to words in version c and published it on GitHub, without feeding it into the pipeline to Shebanq.

dirkroorda commented 4 years ago

I could try and update the c version of shebanq. And it would also be good to prepare a new version, 2020, and add that in Github and Shebanq.

But I think the ETCBC and DANS should agree on that.

oliverglanz commented 4 years ago

Sounds like a plan. Thank you.

dirkroorda commented 4 years ago

I made the pipeline from the BHSA Github repo to SHEBANQ with a view that it be operated by the ETCBC. I still have that view.