Open Ansa211 opened 6 years ago
One more comment: from user's point of view, it would be great if the next version of SYN could have <g/>
as well. People copy-paste examples from their concordances all the time and it looks ugly when their presentations contain all the extra spaces... (Not to mention lexicographers etc.)
Good questions, but I am afraid there are no easy solutions.
<g/>
that would be based on the presence of this particular DISPLAYTAG/DISPLAYBEGIN combination in the registry would thus not be general. So far, our approach is that Manatee/KonText does not need to know about what the individual corpus structures mean. OK, they can be configured to be displayed in a specific way in the registry, but the question is whether this should be mentioned to the users.<g/>
is a very useful structure because it makes it possible to get back to the original tokenization. I also agree that concordances with punctuation marks separated from the original words are ugly. On the other hand, I think that displaying the punctuation glued together with the preceding words could actually confuse the users, because it would conceal how the corpus is handled internally and that searching for (e.g.) "tak," would not yield any result, although this is exactly what the users can see. This is why we are not happy with using the DISPLAYTAG/DISPLAYBEGIN mechanism at all, because it is meant to do what we are trying to avoid.Anyway, please feel free to discuss this further, perhaps you can convince us to change our minds or at least some parts of the code :)
Concerning point 1 - I think it would be useful even if the explanation would be present only for the very exact definition that I have shown (possibly with the name of the structure made variable). When the DISPLAYBEGIN "_EMPTY_"
and DISPLAYTAG 0
are both present and there is no DISPLAYEND
, I cannot really imagine how the structure could do anything other than removing the whitespace between the surrounding tokens. Or the <corpus ...>
elements in corplist.xml
could have an optional parameter glue_structure
just as they have sentence_struct
.
My view on point 2 - even if <g/>
is present in the corpus, ifs functionality is switched off by default and has to be applied by ticking it in the View > Corpus specific settings > Structures
. Now if <g/>
is present, I think that it is more likely that someone ticks it and then becomes confused (I certainly did, even though I knew that it should eat the extra spaces in the text!) than that they will misuse the functionality if it is well explained directly in the settings window. The warning to beware of the disappeared spaces when performing searches should be part of the explanation.
I do not share Michal's opinion concerning point 2 either. But that is a general question of exposing the users to raw data, where I do not think it is really so much helpful as Michal thinks. (But maybe I am just overestimating our common users?) While we are still forced to keep to a single tokenisation (and it does not seem to change in the foreseeable future), I do not think it is the best way to deal with the language. Neither do I appreciate exposing users to our private, raw annotation values and custom labels and abbreviations.
Technically, we are ready to produce our corpora with the glue mark immediately.
Anyway, Anna is right that as long as the default behaviour does not change, there is no reason to avoid an additional feature that we are prepared to include in our data...
To be honest, I don't have a strong opinion on this, but one thing is definitely worth considering: if I understand it correctly, the <g>
tag behaves differently from others. Whereas other structures when selected are displayed, the <g>
in fact disappears and eats the whitespace. This should be probably explained by a brief note because that's conceptually different from other choices in Corpus specific settings menu.
Technically, we are ready to produce our corpora with the glue mark immediately.
Just to be on the safe side: has anyone checked with people from ÚTKL if their tools will manage data with structures nested inside sentences? Or do all structures get removed before the data is sent for tagging, and reinserted afterwards?
In general, I think the user interface should help users build a useful mental model of how the corpora are structured and how to query them (like @michkren, I think they need all the help they can get). This means minimizing the mismatches between what the data are and how they're displayed.
Hiding spaces between tokens is one such mismatch, so it's good it's not the default. Making the <g/>
structure toggle do something completely different from the other ones, based on some config file setting that the user never gets to see, is another mismatch, so adding some help text is the least we should do.
Ideally though, I think there should really be two separate settings -- ticking the <g/>
toggle should simply just show the tag, as it does with every other tag. It's just another piece of annotation that some people might actually want to work with, and currently, it's impossible to export a concordance where these tags are explicitly shown.
Then there should be a second setting with an intuitive label, something like Remove added space around punctuation (better aim for something that is roughly correct but descriptive, rather than something which is technically 100% correct but confusing). This setting would of course leverage information from the <g/>
tags (or perhaps any configurable tag, cf. glue_structure
in corplist.xml
as @Ansa211 mentioned), but wouldn't mention them anywhere and wouldn't require the user to know anything about them, because honestly, if a user just wants to hide these added spaces, how on Earth are they supposed to figure out they should be looking for a setting that displays a structure with magical properties?
Of course, this means that the erasure of spaces would have to be re-implemented KonText-side, but I would argue that's where it belongs in the first place if we truly treat Manatee as a backend. In other words, in the Manatee config file, <g/>
would be defined simply as STRUCTURE g
.
Depending on which of the two preferences above are toggled, these would be the possible outcomes:
# don't show g, don't remove added space (default)
the cat . The dog
# show g, dont' remove space
the cat <g/> . The dog
# show g, remove space
the cat<g/>. The dog
# don't show g, remove space
the cat. The dog
This "tag" is not supposed to be inserted into the data before they are tagged. (Anyway, all markup is currently being removed before tagging and reinserted afterwards.) The "missing space" is naturally preserved in all the text all the time until it is converted into the unnatural "vertical", which is unable to distinguish between a "space" and a "missing space" between tokens - not to speak about other possibilities than just "space" - everything beyond the tokens is just being dumped. Therefore, I would not consider the
While I appreciate the simplicity of understanding the "space" as a generic separator of "tokens", I am actually as much concerned about its misleading role, as you are concerned about the dangers of getting rid of it. The "token" is simply NOT something delimited by two spaces! At the Institute for German Language in Mannheim, they do not consider punctuation to constitute tokens either - punctuation is just another part of the source "raw" text beyond the actual linguistic tokens - like spaces, pictures, tables, charts, formulas, etc. (The text is not simply a chain of tokens and nothing else - it just contains tokens and they do not need to be visually separated and may even overlap or mutate: can't, očs, včeras, načs, vamonos, aux, du, degli...) So they do not throw away spaces, like they do not throw away punctuation and all the other "non-textual" components of the text. While I am not completely sure whether Mannheim's interpretation of punctuation is right, I am still afraid that our CQP-centric view of the language material is much more biased and misleading in the opposite direction and it is unhealthy to insist on it permanently. The "spaced view" might be a nice simplification for children, but this should be primarily an academic project, right?
Once more: The "vertical" is a low level implementation detail of an (obsolete type of) indexing engine, it is not "the data" nor "the corpus". It might have been so in the 1980's, but now we are in 2018. Corpus linguistics should process the real data, not rewrite the data nor the the linguistics. (That is at least my own humble opinion.)
@wanthalf I completely sympathize with your criticism of how the vertical butchers the data. Still, as long as we're using Manatee, I think users should be made more aware of it, precisely because these things are unnatural and counter-intuitive.
Your perspective is informed by intimate knowledge of how both 1980s and 2018 corpus indexing technology works. The average user knows none of this, so hiding the 1980s warts to make what we provide them with prettier compared to the newer stuff will only result in confusion.
The question is who is the target user. I still consider the problems of tokenisation and annotation a basic knowledge and prerequisite for any serious work with any corpus data (as seen in our DigiLing "introductory course").
Anyway, it is not a reason to refuse a request for "advanced" and non-default feature, if we are able to implement it easily. The very same wish was already expressed e.g. by Tilman Berger at the Advisory Board meeting a few years ago. Nobody (but me ;-)) has really expressed the desire to make this the default behaviour, yet.
I've just discussed this with @tomachalek. I personally like @dlukes's solution, but it still requires some special handling of <g/>
on KonText side. Therefore, we decided to put this issue into the queue and get back to you (hopefully) in a few months when we will discuss possibilities of rendering of the original document more generally. Thanks to all for their contributions!
Some corpora contain a structural element that acts as "NoSpaceAfter" for the previous token. According to custom, it is called
g
(for "glue") and definec in the registry file asCorpora with this element include the Araneum corpus (ÚČNK) and the DGT EU corpora (Lindat).
Currently, the default view of the corpus is to insert a space between each two tokens, and the glueing ability of
<g/>
is triggered by ticking it in the dialogueView
->Corpus specific settings
->Structures
. This option has been selected in both of the queries linked above - notice that both in the concordance lines and when you open a wider context window, there is no whitespace between words and following punctuation, except immediately after the KW(IC).<g/>
does not do anything => a word of explanation directly in the settings dialogue would be very helpful (think of users who have no idea what<g/>
might be).