Brown-University-Library / OLD-ARCHIVED_iip-production

3 stars 9 forks source link

word segmentation and lemmatization should ignore <orig> when it's alone #119

Open emylonas opened 3 years ago

emylonas commented 3 years ago

when the <orig> element appears alone, and not as a child of <choice> the content should be ignored. Characters inside that type of <orig> are not words. They represent something that is not recognizable as a word.

Examples: masa0836 has the string <orig>CB</orig> this appears in the word list same in masa0838 caes0062 has the string <orig>C</orig> and <orig>DO</orig> seems to not be in the list

atbradley commented 3 years ago

Do we want these to have @xml:ids?

emylonas commented 3 years ago

yes, the xml:ids help with context for parsing (so orig not useful) and also for creating kwic views, where the orig will be useful.orig not useful) and also for creating kwic views, where the orig will be useful.

If it's in the segmented div, it should have an xml:id. orig doesn't have an xml:lang, and num probably should, so that will have to be copied as for

On Tue, Jul 6, 2021 at 9:07 AM Adam Bradley ***@***.***> wrote: > Do we want these to have @xml:ids? > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > , > or unsubscribe > > . >