alpheios-project / tokenizer

Alpheios Tokenizer Service
1 stars 0 forks source link

empty alignment group with TEI.2 tag #29

Closed monzug closed 2 years ago

monzug commented 3 years ago

with the xml tag from perseus (e.g. www.perseus.tufts.edu/hopper/xmlchunk?doc=Perseus%3Atext%3A2008.01.0594%3Achapter%3D8 ) I do not get any error in tokenization but the alignment groups is empty. if I replace the TEI.2 with the one we have in alpheios (TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:py="http://codespeak.net/lxml/objectify/pytype" py:pytype="TREE") it works.

empty-alignment

irina060981 commented 3 years ago

It returns an empty result because Tokenization service returns an empty array with tokens. I think, I should add an error for that.

monzug commented 3 years ago

Can we support TEI.2?

irina060981 commented 3 years ago

Can we support TEI.2?

@balmas , I believe it is an issue for a remote tokenizer?

balmas commented 3 years ago

yes. transferring.

monzug commented 3 years ago

an other sample of empty text is from following xml file (https://texts.alpheios.net/text/urn:cts:latinLit:phi0472.phi001.perseus-lat2/passage/1.1-1.10): see in edit mode

<?xml version="1.0" encoding="utf-8"?>

Pulsis urbe regibus prima pro libertate arma corripuit. Nam Porsenna rex Etruscorum ingentibus copiis aderat et Tarquinios manu reducebat.

![Screen Shot 2021-01-05 at 4 24 35 PM](https://user-images.githubusercontent.com/41396793/103664422-8f102c80-4f72-11eb-96b0-ef2a571b0bf6.png)
monzug commented 3 years ago

to test also with two target texts (to upload file alignment-lat-lat-eng.tsv)

irina060981 commented 3 years ago

I checked this case with TEI.2, the problem is that the following tag is not enough <TEI.2> should be <TEI.2 xmlns="http://www.tei-c.org/ns/1.0">

With this definition tokenization works properly. Would send this question to Bridgit too

monzug commented 3 years ago

yes, that's how I changed it to successfully align

On Fri, Feb 12, 2021 at 12:55 PM Sklyarova Irina notifications@github.com wrote:

I checked this case with TEI.2, the problem is that the following tag is not enough

should be With this definition tokenization works properly. Would send this question to Bridgit too — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub , or unsubscribe .
irina060981 commented 2 years ago

I have updated tokenizer service to support TEI tags without namespace. How it checks:

  1. First it searches for dts:fragment tag (it is for DTS texts)
  2. If fails - it searches for the tag with namespace xmlns = "http://www.tei-c.org/ns/1.0"

This steps were before my update I have added

  1. If fails - it searches for tag and gets text from it (and all inner nodes)

So now all these templates are rendered

<TEI>
<text>
this is body text
</text>
</TEI>
<TEI xmlns="http://www.tei-c.org/ns/1.0">  
<text>
this is body text
</text>
</TEI>
<TEI.2>
<text>
this is body text
</text>
</TEI.2>
<TEI.2 xmlns="http://www.tei-c.org/ns/1.0">
<text>
this is body text
</text>
</TEI.2>
irina060981 commented 2 years ago

I found one edge case - if you try to tokenize xml with tei namespace that is not equal to "http://www.tei-c.org/ns/1.0", than it fails

I didn't find the way to make namespace less concrete , but I checked and didn't find other correct namespace for TEI.

If we want to support xml with other namespace - I should spend more time and some examples.

monzug commented 2 years ago

They all work now. This is about TEI.2 tag. if there are issues with other tags, we will enter a new github.

irina060981 commented 2 years ago

ok, thank you, @monzug , I think that is really difficult to find all variants at once. But I will be able to do fixes based on examples