dictionaria / pydictionaria

Apache License 2.0
3 stars 0 forks source link

Make cross-reference detection more robust #19

Closed xrotwang closed 5 years ago

xrotwang commented 5 years ago

Currently, it seems cross-references using homonym numbers do not work - e.g. for Palula.

xrotwang commented 5 years ago

The problem seems to be that the IDs in id_index are not updated to the newly generated ones.

xrotwang commented 5 years ago

Ok, the ID index is created after new IDs have been assigned - so that's correct. But the original IDs should incorporate homonym numbers. Otherwise they are not unique - so id_index won't be injective. E.g. when processing palula, there are two lexemes bíi. Because they contain non-ASCII characters, a new ID assigned to them (but for both new IDs bíi is kept as "original ID"):

bíi
LX000404
bíi
LX000405

this results in mapping bíi to the last new ID assinged to this original ID in id_index.

Also, variants with homonym numbers are not added to id_index as expected - but I don't know why that happens, yet.