DISSINET / InkVisitor

An open-source, browser-based front-end application for the collection of complex structured data from textual resources in history and the social sciences into a RethinkDB database for further analysis.
BSD 3-Clause "New" or "Revised" License
10 stars 3 forks source link

Change sorting of entity labels: by order of appearance in this full-text #2051

Open davidzbiral opened 6 months ago

davidzbiral commented 6 months ago

Sort the entity labels spanning the active text from inside outwards, i.e. from the closest to the farthest. I.e. e.g. from a Location to the Statement it is in to immediate subT to parent subT to grandparent subT etc.

adammertel commented 6 months ago

@davidzbiral This is not a precisely defined rule. What would be the order in case TEXT2 is selected in the example below?

<a>text1 <c> TE<d>XT<b>2 </a> text3 </b> text4 text5 text 6</c>text 7</d>

I think we need to be more specific. What about sorting by entity class + alphabet? I think it might be easier to search for a specific entity in such a case.

davidzbiral commented 6 months ago

@adammertel We are operating on a word (word token) basis, so the example is somewhat artificial (even if allowed by the application) - but that's just a small note, and there are languages where we will need character level.

I will reformulate completely, my request was not ideal: those which start in the selected span, sort by order of appearance, i.e. a, c, d, b. I.e. let's make it more simple (perhaps it is how you do it? Sorry, no time to open app now and inquire): always sort by order of appearance in the full-text. Both those which start outside of the span and within the span. I.e. first you will see the whole-full-text, then subT, then subsubT, then S, then e.g. Location within that statement, etc. I.e. let's follow the order of appearance of the start tag.

This means that you should have a process of whether a new anchor on the same span that already has one should be put inside or outside, if you get what I mean. E.g. if "Lombardia" already is enclosed with anchors of L Lombardy, and then I select it again, whether the new anchors should go inside, or outside. I think that inside. Definitely not crossing anchors (which would be - completely unnecessarily - xml-invalid).

davidzbiral commented 6 months ago

@adammertel So do whatever appropriate with this issue, but I appended a new topic, the one of enclosing same span into anchors. Oh and Adam, I think we should prevent from creating differences by whitespace alone. I.e. it should not probably be possible to do Lombardy . The span difference is the whitespace only. Generally this should be prevented I think, so if the span starts or ends with a whitespace, the anchor should stick to the words and exclude the anchors (i.e. some "trimming" of sorts, of course not changing text, but sticking the anchors to the actual words or punctuation, excluding the start and end whitespace). Because the structure Lombardy is unnecessarily invalid and confusing, if all the user wanted to do is enclose "Lombardy" with two tags and misselecting also the following whitespace.

Should I create a new issue which will describe these two things, and you will then choose what needs to be done with the original issue?