Closed foxik closed 7 years ago
Why would you use \p
? "|" is not a whitespace character, thus it should be a normal token and not a SpacesAfter
attribute. On the other hand, people may need to encode TAB
, CR
and LF
.
I can imagine that in some circumstances the SpacesAfter
/SpacesBefore
might contain non-space characters. Although this is not useful from point of view of UD treebanks, it is useful when using CoNLL-U as a pipeline format. For example, if the input is some kind of XML file with elements encoding some meta information, it would make sense to consider the XML elements "non-tokens" and save then to SpacesAfter
and SpacesBefore
. I.e.,
<h1>Hello</h1>
could be encoded as
Hello\t...\tSpacesAfter=</h1>|SpacesBefore=<h1>
The names of the features can be misleading in this usecase and maybe should be different, something like TextAfter/Before
, NontokenAfter/Before
or IgnoredAfter/Before
. But I did not particularly like any of those, and SpacesAfter
correspond to SpaceAfter
with which it interacts, so I proposed SpacesAfter/Before
, while assuming it could contain non-space characters.
I am definitely open to different feature names, if you have any suggestions :-)
Fine with me. Then it actually is a "space" from the point of view of treebank annotation. But this should be briefly mentioned in the documentation, too.
Noted, I will mention it if the proposal gets in.
Do you have any thoughts about the escaping scheme?
Not sure whether we also want/have to escape the other space characters in the higher Unicode space. Otherwise, the \s
etc. sound reasonable to me. First, it is short => readable. Second, it is what I often use elsewhere (but this is quite subjective argument of course).
Note that the MISC
attributes are not required to appear in alphabetical order (unlike FEATS
) and I think it is OK so. Thus we may also use the logical order here, i.e. SpacesBefore=...|SpacesAfter=...
.
As this is something that does not directly interfere with existing UD guidelines, nor with released data, I think we do not even have to wait for v2 guidelines with it. But let's wait for the other UD'ers wether they have something to say here.
I definitely second this proposal! Based on our recent effort to align the Norwegian treebank with the corresponding raw texts we felt there was important information lost regarding spacing.
Just a minor detail: should there be an order of preference for the spacesBefore=...|spacesAfter=...
. E.g. in sentence1 \n sentence2
, should we use spacesAfter
on the last token of sentence1
or spacesBeforeon the first token in
sentence2`?
Regarding the preference -- the current proposal allows SpacesBefore
on the first token of every sentence. Therefore, both possibilites in the above example are valid.
In theory it would be possible to allow SpacesBefore
on the first token of the first sentence only (i.e., in the above example, SpacesAfter
on last token of sentence1
would have to be used). However, such a rule is inconvenient in situations, when multiple "documents" (from a logical point of view, marked for example using sent_id
) are encoded in one CoNLL-U file, because in this situation it makes sense to use SpacesBefore
on beginning of every (logical) document.
Another example is an NLP pipeline, which processes data "on the fly" -- in this situation, every paragraph of the plain text can be processed individually (sentences cannot go over paragraph boundary). If a paragraph starts with spaces, it is more convenient to store the spaces using SpacesBefore
on the first token of this paragraph, than in SpacesAfter
on the last token of the previous sentence (because that would mean the previous sentence could not be processed until a following non-space character is found).
So I would say that SpacesAfter
are preferred if possible, but SpacesBefore
are allowed on first token of any sentence.
Yes, I also think that would be a great thing to have (we already have something similar in CoreNLP so that people can reconstruct sentences after tokenization) and I'm also in favor of C-like escaping.
Instead of having the SpacesBefore
attribute as a token property, we could also consider having an optional sentence-level comment before each sentence. Then we wouldn't have this token-level property that is only allowed to be used on the first token, which might be confusing to some people. Or we could just allow it on every token.
I would not make it a sentence-level comment when the remaining pieces are token-level attributes. Allowing SpacesBefore
on every token sounds OK to me, in the sense that the validator would not complain, and processing software should be able to deal with it. Generating it on non-first tokens would not be recommended in the normal cases, but e.g. when it is used to hide XML markup, as @foxik suggests above, it may be sometimes even more intuitive, so I would not ban it.
Allowing SpacesBefore
everywhere sounds reasonable (actually, the same argument I used for allowing SpacesBefore
in every sentence can be used for allowing it on every token, as Dan does above).
I will prepare the changes as a pull request, so that others can discuss them. (But probably at the beginning of the next week.)
Is SpacesBefore/SpacesAfter covering all cases of interest? Some tokenizers do some small reorderings sometimes (e.g. switching double quotes and punctuation, depending on the language) and normalize characters (e.g. punctuation and some unicode characters). Do we want to support this? The standard approach here is to have the full sentences somewhere (maybe in a separate file) and to encode the character-level start/end positions of each word token.
2016-08-25 12:13 GMT+01:00 Milan Straka notifications@github.com:
Allowing SpacesBefore everywhere sounds reasonable (actually, the same argument I used for allowing SpacesBefore in every sentence can be used for allowing it on every token, as Dan does above).
I will prepare the changes as a pull request, so that others can discuss them. (But probably at the beginning of the next week.)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/332#issuecomment-242352066, or mute the thread https://github.com/notifications/unsubscribe-auth/ABU1wxHSo87JF4FAgYn8jmUnjB8eW3FOks5qjXjKgaJpZM4JqPoW .
Hi
Sorry I'm a bit late to this discussion. I think @andre-martins follows the right lead here, with the character offsets. I myself would much rather see that, than a complex mixture of SpacesBefore/SpacesAfter and escaping. A single attribute charoff=3:7
saying that characters 3,4,5,6 of the original sentence form this token, would imho cover all situations.
Filip
SpacesBefore/After would probably be a rare feature while CharOff would have to be present at every single token in the data (and the full sentence would have to be there in addition, of course).
But I agree that if replacements and reorderings have to be supported then this is the way to go. Not sure whether they have to, though.
I think both use cases are useful at times. Sometimes you want to reconstruct the data at some point (for example when you started with a XML) and you would like all the data to be in the document -- then SpacesBefore
and SpacesAfter
seem the straightforward way to do it.
Note that if we want to use something like TokenRange=3:5
, we should also specify how to encode the original sentence -- the easiest would probably be to encode the sentence in a comment (only the \r and \n would have to be escaped).
When I started with the proposal, I was thinking about suggesting both alternatives, but then I decided to go with just the SpacesBefore/After. But you are right, the SpacesBefore/After do not support reorderings and replaces.
BTW, if we want the TokenRange to be more effective (and also playing together with SpaceAfter
), we could define a reasonable default -- the length would be the length of the form, and the offset would be 0 for the first token, and either last_token_end+1 or last_token_end (depending on whether previous token has SpaceAfter=No
). This way, the TokenRange
would be needed only when 1) the original token has different length than the form, or 2) there is more than 1 "space" character between the tokens.
So if we want a more general correspondence with the original text, I would:
TokenRange
with the defaults described aboveThoughts?
I think the defaults are a good idea, so we do not clutter the file. Orig sentence as a comment is I think the way to go (we already do this for Finnish). I know that some treebanks will not be able to provide the orig sentence, but will be able to provide the SpaceAfter=No (legal reasons), so that usecase is also supported.
(BTW, even if you could not provide the original sentence, but could provide the spaces only, it would still be possible to do it -- the sentence would consist of spaces only and the TokenRanges would have zero length :-)
Note that "standardizing" the sentence in the comment interacts with #273 -- especially the format. The last suggestion in #273 is ^# ([a-z_]+) *= *(.*)$
, but that has a serious disadvantage -- the spaces at the beginning of the sentence cannot be represented (because all spaces after the =
are "eaten up"). Also #273 suggests sentence
(and some other variants) as the name of the attribute containing the sentence text.
In order to deal with the spacing issue, I suggest we use ^# [a-z_0-9]+=(.*)$
in #273, and also define escaping rules (\r
, \n
and \\
). Then I also suggest to use text
instead of sentence
to store the text of the sentence (because sentence
suggests also other meanings like sentence id, while text
does not).
BTW, it seems to me that the SpacesBefore
/SpacesAfter
are simpler than the sentence_text+ranges+defaults [because they are quite straightforward and do not require sentence-level comments], and slightly more effective when no reorderings and replacements are used [as they do not need to store the original tokens]; however, they do not support reorderings and replacements. One possiblity would be to handle the reorderings/replacements differently (replacements by using for example OriginalToken=...
and reorderings like ReorderOffset=
).
On the other hand, storing the whole sentence in the comment seems to be a commonly asked-for feature.
@foxik I second that.
I don't see why we should support reordering. To me it is something else than what UD is about, and people who want to do it in CoNLL-U should define the mechanism separately.
Supporting replacement sounds reasonable, it also interacts with handling of typos, see #330, but then something like OriginalToken=...
is quite OK.
Full text of sentence (the # text
comment) can still be provided; as long as you do not count inter-sentence whitespace as part of any sentence, it is much easier. That could be covered by the SpacesBefore/After
attributes, which would be the main device for reconstructing the source text.
For the record: I think the current SpaceAfter=No
was just fine as it was and I personally do not see CoNLL-U as a universal data communication format. If the original sentence text is given as a comment, the tokens can be found in it, if someone really cares about extra spaces. So my primary vote would go to "include the sentence as a comment, leave the rest as is".
But if this still needs to be redesigned... ...the SpacesBefore/After just seems quite opaque with all the escaping, ambiguities of where SpacesBefore can be used, etc. Character offsets supports all situations I can think of.
I would like to know what are the use cases for
Of course, all these are necessary for reconstructing the original raw text, but my question is for what real-word use cases do you need to reconstruct it so exactly?
Even with all the suggestions above, we may not be able to reconstruct the original fully: the original may not be in utf8 (not even in Unicode) or it may contain invalid Unicode encoding.
Note also that for preserving the original (XML) markup SpacesAfter
is not enough. There may be markup inside tokens.
As always there are pros and cons. By preserving these distinctions we complicate more or less the CoNLL-U specification. I can imagine spaces between sentences are important for training segmenters on noisy data (where such spaces are missing), but if this is the only application, we could sacrifice the possibility of using CoNLL-U for such training.
That said, I like @fginter 's suggestion "include the sentence as a comment, leave the rest as is". If someone needs character offsets for the explicit alignment, they can always use it, but I am not sure we should standardize it now.
I think the use case for preserving whitespace has to do with aligning the dependencies to a different format. Some multilayer corpora might have typographical or document-structural annotations that refer to stand-off offsets in an original plain text or HTML file, often with XML + x-pointer (examples include the PAULA XML format and GrAF). NLP tools might also need to return dependency information back to a representation that preserves whitespace (for example, the spacy toolchain keeps it throughout).
Even within a pure treebank, sometimes the reason for an automatic sentence split by a sentence tokenizer is based on multiple white space (think of a date and author line not separated by period, but by multiple white spaces), so that the whitespace lets you reconstruct why sentence splitting was done in a certain way.
Also @dan-zeman - I know you opposed delimiters for original sentence text in #273 but maybe it's worth considering it given some of the problems that SpaceBefore/After and offsets raise. Then offsets could be marked only if they are non-trivial with respect to the original sentence text, and normal spaces before and after could go unannotated.
@amir-zeldes - I would personally prefer saying that characters before and after the last token of a sentence are not part of the sentence. If it is necessary to store such inter-sentence material, it should not go to the text
attribute, and (on a second thought) I would not put it to individual tokens either. I would put it to sentence-level comment different from text
. And I now lean towards not standardizing it here.
I agree that having a sentence comment relying on spaces is weird.
Note that there seemed to be people interested in the functionality (preserving multiple spaces). There are also people in companies asking us about this. I thing it would be a shame not to deal with the multiple spaces between tokens just because it seems complicated and nonelegant.
Personally I am in favor of the original proposal -- the SpacesAfter
and SpacesBefore
. The cons that have been brought up:
Pros from my point of view: except for the technical issue of escaping, SpacesBefore
/SpacesAfter
seem to be straightforward and obvious. If someone is happy with SpaceAfter
, they can keep using it, but if they need more functionality, they can extend it easily.
I think it makes little sense to standardize the character offsets without allowing spaces in various places of the original sentence text (i.e., disallowing newlines inside and at the end, which is not possible without newlines) -- it cannot be used to preserve the spaces.
Just a motivation for preserving spaces at the end of sentence -- in English, some sentence breaks seem to be forced by paragraph ends in the text. If we had the full spacing information, we would now it, and would not train the segmenter on these cases.
I am not sure what the current consensus is. To sum up:
SpacesAfter/Before
design and suggested to use character offsets, possibly including the sentence in the commentPersonally I am still interested in the proposal (the SpacesAfter/SpacesBefore
with escaping), for the following reasons:
SpaceAfter=No
, but they are present in the text to indicate the word boundary)I understand that I can use the feature without standardizing it, but I believe it is useful to more people, so I think it would be better to standardize it (and not let multiple people reinvent it, each dealing with the tricky corners like escaping and inter-sentence spaces differently). Also note that if someone is not interested in more details than SpaceAfter=No
provides, they are not affected at all.
Any thoughts, anyone?
Hi. My thoughts:
# text:
matters and if you catenate the texts with # text:
removed, you get the original document in a pure text form.# text:
field. Tokenizers should not eat or invent characters. This has the happy consequence that finding out the exact offsets of the tokens in the text, reconstructing any whitespace before or after, etc is pretty much a single for loop and can be coded into whatever libraries we have for dealing with conllu.text
metadata item will give us the ability to learn sentence splitters and tokenizers from the conllu data, which is a fine feature to have.How about?
F
A text like:
I have
a dog.
The dog
is red.
would then look like
#text: I have\na dog.\n\n
1 I
2 have
...
#text: The dog\nis red.
1 The
2 dog
...
Note:
Thanks @fginter for writing. My thoughts:
text = Hi
the same as test = Hi
(two spaces before Hi)\r
if we are escaping \n
BTW, note that the "sentence in the comments" approach have the same issues as SpacesBefore/SpacesAfter -- there will be some escaping, and it is ambiguous where to put inter-sentence spaces. Also both play nice with SpaceAfter=No
. So I believe the issues are generic for the problem we are trying to solve.
From my point of view, SpacesBefore/SpacesAfter are still the first choice -- the information is directly attached to the tokens (and not in the comments), so you do not need to reconstruct the information if you are interested in knowing "what spaces followed this token". In the "sentence in the comment" approach, even for "what spaces follow the sentence" question you have do something like finding the last token in the sentence text first, because it is nontrivial to say which characters at the end of sentence text do not belong in the last token (some space-like characters are not in Zs unicode category -- if you find a character in Cf category, which may really happen, does it belong to the token or is it a space between sentences?) Also, for common cases (single spaces between tokens), the SpacesBefore/SpacesAfter are more efficient (they would not be present, except probably for SpaceAfter on the last token; while sentence text has to repeat all tokens and spaces).
PS:
If a tokenizer really finds it important to replace double quotes with something else, why not use the token-word distinction. I.e. the token is whatever is in the sentence, and the word is whatever the tokenizer wants to make it and the word is then part of the tree.
Yes, using token-word distinction for replacements is an elegant approach I was thinking about in the past. However, it is not obvious whether it is allowed by the current CoNLL-U format, because only "multi-word" tokens seem to be allowed. Specifically, I assume that single-word tokens would use a range like x-x
, so something like:
1-1 ˝
1 "
If you think it is worth it (I do, so that would be two of us), we could create another issue discussing it.
text
property is special in the sense that whitespace matters and that's it# text: <textgoeshere>#endtext
... I'd kind of hate to have to this, but ...well... a hack is a solution too :)text
, we'd have a very broad set of things covered.I think
1-1 ˝
1 "
would be quite okay.
PS:
Thanks @fginter for your answers. I think we both know each other's positions (and that they are different :-)
Any other opinions and preferences, anyone?
# text:
(or # sentence-text:
) is already used and there appears to be general support for standardizing it in #273. As this proposal has (partially) overlapping goals and no clear consensus, I would suggest to postpone decisions until # text:
is standardized (or removed, if so happens).
After we know how # text:
gets defined, we can re-examine whether there are use-cases for spacesAfter
and spacesBefore
that it doesn't cover.
I would generally prefer to avoid redundant ways to specify the same information, as this adds unnecessary complexity and ways to have internally inconsistent data.
(FWIW, although I'm not deeply invested in this feature, I'm not thrilled with SpaceAfter
and would prefer not to extend it; it just doesn't feel like the Right Thing™ to me.)
OK, here is my position:
# text:
, but as @spyysalo writes, it would then be redundant, which has a downside, too.SpacesAfter
and SpacesBefore
within the UD documentation but I don't feel the urge to do so. If you (@foxik) manage to use it for what I believe you want to use it, then it will become a standard without even being mentioned on the UD website :-)And here are my opinions:
# text =
for other purposes (e.g. in pipelines for storing documents after segmentation and before tokenization). I think we have agreed in #273 we want to standardize it (no one objected), we are just discussing details of the format (=
vs :
). I would suggest that # text =
is not required in CoNLL-U, but recommended (and maybe required in future UD treebanks).# text =
. The disadvantage would be that in a typical case (one space between sentences), there will be extra \s at the end. In agreement with @dan-zeman, I still consider this much better than relying on editors not deleting trailing whitespaces. Alternatively, we could enclose the whole sentence text in quotes (and escape quotes), but this means that in the typical case the quotes will be needed because of the space after fullstop ("The dog is red. "
) and we will probably still need \n and \t.# text =
, the token-level attributes SpacesAfter
and SpacesBefore
and actually also SpaceAfter=No
are redundant. We still may want to keep them (at least SpaceAfter=No
) as they are faster in the use cases when we need them at token level. However, if the main use case is to reconstruct the original raw sentence, then it is faster to use directly # text =
.In general (probably as everyone here), I would like to keep CoNLL-U simple, uncluttered and intuitive, especially in the typical use case. For this reason, I was originally against SpacesAfter
and SpacesBefore
as it complicates the specification and is redundant once we have # text =
. However, if we want to standardize storing spaces between sentences, the original @foxik's proposal sounds good - there is no need to store \s at the end of each # text =
because one space after each token is the default case. So SpacesAfter
will be used only if there are multiple spaces between tokens or other whitespace than space. This is relatively rare, so a typical file will remain uncluttered. Thus my current position is neutral.
I also agree that:
# text=
and plan to use it1-1
FWIW, I do not like abusing the multi-word token ranges for single-word replacements (1-1
). These are meant for grammar-conditioned contractions. Mere presence of such lines says something about the language, while this would be just about the particular tokenization/normalization approach. Although it could be distinguished by i being equal to j, I strongly oppose it.
Any modifications relating to just one word should stay at the line dedicated to that word. As said earlier, this relates to my proposal in #330. A MISC attribute containing the modified text (if the original stays in FORM; or the original text, if FORM is modified) is all what is needed.
I might have missed it - this thread is quite long - but did you consider text between sentences, e.g.
#text: I have a dog.
1 I
2 have
...
#text: \n\n
#text: The dog is red.
1 The
2 dog
...
So here would be a segment of text with two line breaks between the two sentences.
Cf. discussion on this over at WebAnno: https://github.com/webanno/webanno/issues/313#issuecomment-232486111
We did not reach a consensus (I think) but we did consider it. If it is represented it will be at least technically part of either the preceding or the following sentence. Whether it is part of the #text
comment (which I personally do not welcome) or some other type of comment (either sentence-level comment, or token-level MISC
attribute), is mostly what this thread is about.
Closing as v2 is now being published.
For reference, comments used to specify the original sentence are in v2 with the format # text =
(http://universaldependencies.org/format.html). These permit but do not standardize escape sequences (e.g. # text = ...\n\n
). SpaceAfter
is likewise in and its use is strongly encouraged:
[...] information on original word segmentation should be kept if available. Every token after which there was no space in the original text should contain
SpaceAfter=No
in its MISC field.
SpacesAfter
and SpacesBefore
are not part of the v2 standard, but as the MISC field is free-form excepting the minimal constraint that it "has to be formatted as a list that can be split on the bar character (|) without special escaping", particular applications are not prohibited from using these.
CoNLL-U format currently does not specify how to represent all space characters of the original plain text. It only specifies the
SpaceAfter=No
feature denoting that the current token is not followed by a space in the original text.However, the ability to represent all space characters would be useful both for:
Therefore I propose to add the following two features to the MISC column:
SpacesAfter=text
SpacesBefore=text
(required because there can be spaces before the first token)The new features have very similar name to existing
SpaceAfter
, but as they interact with each other, I think it is fine.The new features can be present in any token (but not word, similarly to
SpaceAfter
) and have the following semantics:SpaceAfter=No
feature, it cannot haveSpacesAfter
feature (it is ignored if it is present)SpaceAfter=No
and hasSpacesAfter=spaces
, it was followed byspaces
in the original textSpaceAfter=No
norSpacesAfter
, it was followed by one space in the original textSpacesBefore=spaces
is present, the token was preceded byspaces
in the original textSpacesBefore
can be used only on the first token of every sentenceThe content of the
SpacesBefore
/SpacesAfter
features has to be escaped in such a way that it never contains SPACE, TAB, CR, LF, PIPE characters, and it should encode SPACEs efficiently. Many possibilites come to mind:\u0020
and\u007c
 
\s
(space),\t
,\r
,\n
,\p
(pipe),\\
Personally I vote for the last possibility -- C-like escaping with
\s
and\p
. It is true that some existing escaping system would help with decoding, but 1) the existing systems seem to handle space quite ineffectively, 2) an existing system does not help with encoding, as it most likely would not escape a space 3) decode the proposed system is a trivial sequence of replaces 4) the proposed system has unique representation of the original text (which is not true for many escaping systems, but I think it is a useful feature in this case).If a consensus is reached, I will document the new features on the http://universaldependencies.org/format.html site.