Allow representing all space characters of the original text in the CoNLL-U format.

foxik commented 8 years ago

CoNLL-U format currently does not specify how to represent all space characters of the original plain text. It only specifies the SpaceAfter=No feature denoting that the current token is not followed by a space in the original text.

However, the ability to represent all space characters would be useful both for:

UD treebanks, where the exact (or closer) correspondence with the original corpus could be achieved,
NLP pipelines using CoNLL-U format, so that the original plain text can be reconstructed at any time

Therefore I propose to add the following two features to the MISC column:

SpacesAfter=text
SpacesBefore=text (required because there can be spaces before the first token)

The new features have very similar name to existing SpaceAfter, but as they interact with each other, I think it is fine.

The new features can be present in any token (but not word, similarly to SpaceAfter) and have the following semantics:

if a token has SpaceAfter=No feature, it cannot have SpacesAfter feature (it is ignored if it is present)
if a token does not have SpaceAfter=No and has SpacesAfter=spaces, it was followed by spaces in the original text
if a token does not have SpaceAfter=No nor SpacesAfter, it was followed by one space in the original text
if SpacesBefore=spaces is present, the token was preceded by spaces in the original text
the SpacesBefore can be used only on the first token of every sentence

The content of the SpacesBefore/SpacesAfter features has to be escaped in such a way that it never contains SPACE, TAB, CR, LF, PIPE characters, and it should encode SPACEs efficiently. Many possibilites come to mind:

JSON-like escaping: the space and pipe characters would have to be \u0020 and \u007c
HTML-like escaping using numbered entities like  
custom C-like escaping allowing only the following escapes: \s (space), \t, \r, \n, \p (pipe), \\

Personally I vote for the last possibility -- C-like escaping with \s and \p. It is true that some existing escaping system would help with decoding, but 1) the existing systems seem to handle space quite ineffectively, 2) an existing system does not help with encoding, as it most likely would not escape a space 3) decode the proposed system is a trivial sequence of replaces 4) the proposed system has unique representation of the original text (which is not true for many escaping systems, but I think it is a useful feature in this case).

If a consensus is reached, I will document the new features on the http://universaldependencies.org/format.html site.

foxik commented 8 years ago

Also if a consensus is reached, it will be implemented in UDPipe.

dan-zeman commented 8 years ago

Why would you use \p? "|" is not a whitespace character, thus it should be a normal token and not a SpacesAfter attribute. On the other hand, people may need to encode TAB, CR and LF.

foxik commented 8 years ago

I can imagine that in some circumstances the SpacesAfter/SpacesBefore might contain non-space characters. Although this is not useful from point of view of UD treebanks, it is useful when using CoNLL-U as a pipeline format. For example, if the input is some kind of XML file with elements encoding some meta information, it would make sense to consider the XML elements "non-tokens" and save then to SpacesAfter and SpacesBefore. I.e., <h1>Hello</h1> could be encoded as Hello\t...\tSpacesAfter=</h1>|SpacesBefore=<h1>

The names of the features can be misleading in this usecase and maybe should be different, something like TextAfter/Before, NontokenAfter/Before or IgnoredAfter/Before. But I did not particularly like any of those, and SpacesAfter correspond to SpaceAfter with which it interacts, so I proposed SpacesAfter/Before, while assuming it could contain non-space characters.

I am definitely open to different feature names, if you have any suggestions :-)

dan-zeman commented 8 years ago

Fine with me. Then it actually is a "space" from the point of view of treebank annotation. But this should be briefly mentioned in the documentation, too.

foxik commented 8 years ago

Noted, I will mention it if the proposal gets in.

Do you have any thoughts about the escaping scheme?

dan-zeman commented 8 years ago

Not sure whether we also want/have to escape the other space characters in the higher Unicode space. Otherwise, the \s etc. sound reasonable to me. First, it is short => readable. Second, it is what I often use elsewhere (but this is quite subjective argument of course).

Note that the MISC attributes are not required to appear in alphabetical order (unlike FEATS) and I think it is OK so. Thus we may also use the logical order here, i.e. SpacesBefore=...|SpacesAfter=....

As this is something that does not directly interfere with existing UD guidelines, nor with released data, I think we do not even have to wait for v2 guidelines with it. But let's wait for the other UD'ers wether they have something to say here.

liljao commented 8 years ago

I definitely second this proposal! Based on our recent effort to align the Norwegian treebank with the corresponding raw texts we felt there was important information lost regarding spacing.

Just a minor detail: should there be an order of preference for the spacesBefore=...|spacesAfter=.... E.g. in sentence1 \n sentence2, should we use spacesAfteron the last token of sentence1 or spacesBeforeon the first token insentence2`?

foxik commented 8 years ago

Regarding the preference -- the current proposal allows SpacesBefore on the first token of every sentence. Therefore, both possibilites in the above example are valid.

In theory it would be possible to allow SpacesBefore on the first token of the first sentence only (i.e., in the above example, SpacesAfter on last token of sentence1 would have to be used). However, such a rule is inconvenient in situations, when multiple "documents" (from a logical point of view, marked for example using sent_id) are encoded in one CoNLL-U file, because in this situation it makes sense to use SpacesBefore on beginning of every (logical) document.

Another example is an NLP pipeline, which processes data "on the fly" -- in this situation, every paragraph of the plain text can be processed individually (sentences cannot go over paragraph boundary). If a paragraph starts with spaces, it is more convenient to store the spaces using SpacesBefore on the first token of this paragraph, than in SpacesAfter on the last token of the previous sentence (because that would mean the previous sentence could not be processed until a following non-space character is found).

So I would say that SpacesAfter are preferred if possible, but SpacesBefore are allowed on first token of any sentence.

sebschu commented 8 years ago

Yes, I also think that would be a great thing to have (we already have something similar in CoreNLP so that people can reconstruct sentences after tokenization) and I'm also in favor of C-like escaping.

Instead of having the SpacesBefore attribute as a token property, we could also consider having an optional sentence-level comment before each sentence. Then we wouldn't have this token-level property that is only allowed to be used on the first token, which might be confusing to some people. Or we could just allow it on every token.

dan-zeman commented 8 years ago

I would not make it a sentence-level comment when the remaining pieces are token-level attributes. Allowing SpacesBefore on every token sounds OK to me, in the sense that the validator would not complain, and processing software should be able to deal with it. Generating it on non-first tokens would not be recommended in the normal cases, but e.g. when it is used to hide XML markup, as @foxik suggests above, it may be sometimes even more intuitive, so I would not ban it.

foxik commented 8 years ago

Allowing SpacesBefore everywhere sounds reasonable (actually, the same argument I used for allowing SpacesBefore in every sentence can be used for allowing it on every token, as Dan does above).

I will prepare the changes as a pull request, so that others can discuss them. (But probably at the beginning of the next week.)

andre-martins commented 8 years ago

Is SpacesBefore/SpacesAfter covering all cases of interest? Some tokenizers do some small reorderings sometimes (e.g. switching double quotes and punctuation, depending on the language) and normalize characters (e.g. punctuation and some unicode characters). Do we want to support this? The standard approach here is to have the full sentences somewhere (maybe in a separate file) and to encode the character-level start/end positions of each word token.

2016-08-25 12:13 GMT+01:00 Milan Straka notifications@github.com:

Allowing SpacesBefore everywhere sounds reasonable (actually, the same argument I used for allowing SpacesBefore in every sentence can be used for allowing it on every token, as Dan does above).

I will prepare the changes as a pull request, so that others can discuss them. (But probably at the beginning of the next week.)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/332#issuecomment-242352066, or mute the thread https://github.com/notifications/unsubscribe-auth/ABU1wxHSo87JF4FAgYn8jmUnjB8eW3FOks5qjXjKgaJpZM4JqPoW .

fginter commented 8 years ago

Hi

Sorry I'm a bit late to this discussion. I think @andre-martins follows the right lead here, with the character offsets. I myself would much rather see that, than a complex mixture of SpacesBefore/SpacesAfter and escaping. A single attribute charoff=3:7 saying that characters 3,4,5,6 of the original sentence form this token, would imho cover all situations.

Filip

dan-zeman commented 8 years ago

SpacesBefore/After would probably be a rare feature while CharOff would have to be present at every single token in the data (and the full sentence would have to be there in addition, of course).

But I agree that if replacements and reorderings have to be supported then this is the way to go. Not sure whether they have to, though.

foxik commented 8 years ago

I think both use cases are useful at times. Sometimes you want to reconstruct the data at some point (for example when you started with a XML) and you would like all the data to be in the document -- then SpacesBefore and SpacesAfter seem the straightforward way to do it.

Note that if we want to use something like TokenRange=3:5, we should also specify how to encode the original sentence -- the easiest would probably be to encode the sentence in a comment (only the \r and \n would have to be escaped).

When I started with the proposal, I was thinking about suggesting both alternatives, but then I decided to go with just the SpacesBefore/After. But you are right, the SpacesBefore/After do not support reorderings and replaces.

BTW, if we want the TokenRange to be more effective (and also playing together with SpaceAfter), we could define a reasonable default -- the length would be the length of the form, and the offset would be 0 for the first token, and either last_token_end+1 or last_token_end (depending on whether previous token has SpaceAfter=No). This way, the TokenRange would be needed only when 1) the original token has different length than the form, or 2) there is more than 1 "space" character between the tokens.

So if we want a more general correspondence with the original text, I would:

specify how the original sentence can be encoded in the sentence-level comment
define TokenRange with the defaults described above

Thoughts?

fginter commented 8 years ago

I think the defaults are a good idea, so we do not clutter the file. Orig sentence as a comment is I think the way to go (we already do this for Finnish). I know that some treebanks will not be able to provide the orig sentence, but will be able to provide the SpaceAfter=No (legal reasons), so that usecase is also supported.

foxik commented 8 years ago

(BTW, even if you could not provide the original sentence, but could provide the spaces only, it would still be possible to do it -- the sentence would consist of spaces only and the TokenRanges would have zero length :-)

Note that "standardizing" the sentence in the comment interacts with #273 -- especially the format. The last suggestion in #273 is ^# ([a-z_]+) *= *(.*)$, but that has a serious disadvantage -- the spaces at the beginning of the sentence cannot be represented (because all spaces after the = are "eaten up"). Also #273 suggests sentence (and some other variants) as the name of the attribute containing the sentence text.

In order to deal with the spacing issue, I suggest we use ^# [a-z_0-9]+=(.*)$ in #273, and also define escaping rules (\r, \n and \\). Then I also suggest to use text instead of sentence to store the text of the sentence (because sentence suggests also other meanings like sentence id, while text does not).

foxik commented 8 years ago

BTW, it seems to me that the SpacesBefore/SpacesAfter are simpler than the sentence_text+ranges+defaults [because they are quite straightforward and do not require sentence-level comments], and slightly more effective when no reorderings and replacements are used [as they do not need to store the original tokens]; however, they do not support reorderings and replacements. One possiblity would be to handle the reorderings/replacements differently (replacements by using for example OriginalToken=... and reorderings like ReorderOffset=).

On the other hand, storing the whole sentence in the comment seems to be a commonly asked-for feature.

dan-zeman commented 8 years ago

@foxik I second that.

I don't see why we should support reordering. To me it is something else than what UD is about, and people who want to do it in CoNLL-U should define the mechanism separately.

Supporting replacement sounds reasonable, it also interacts with handling of typos, see #330, but then something like OriginalToken=... is quite OK.

Full text of sentence (the # text comment) can still be provided; as long as you do not count inter-sentence whitespace as part of any sentence, it is much easier. That could be covered by the SpacesBefore/After attributes, which would be the main device for reconstructing the source text.

fginter commented 8 years ago

For the record: I think the current SpaceAfter=No was just fine as it was and I personally do not see CoNLL-U as a universal data communication format. If the original sentence text is given as a comment, the tokens can be found in it, if someone really cares about extra spaces. So my primary vote would go to "include the sentence as a comment, leave the rest as is".

But if this still needs to be redesigned... ...the SpacesBefore/After just seems quite opaque with all the escaping, ambiguities of where SpacesBefore can be used, etc. Character offsets supports all situations I can think of.

martinpopel commented 8 years ago

I would like to know what are the use cases for

preserving the distinction between a single space and multiple spaces
preserving the distinction between a standard space and CR/LF/TAB or other Unicode spaces (e.g. non-breaking space and U+2000...U+200B)
preserving the info about (missing) spaces between sentences

Of course, all these are necessary for reconstructing the original raw text, but my question is for what real-word use cases do you need to reconstruct it so exactly?

Even with all the suggestions above, we may not be able to reconstruct the original fully: the original may not be in utf8 (not even in Unicode) or it may contain invalid Unicode encoding. Note also that for preserving the original (XML) markup SpacesAfter is not enough. There may be markup inside tokens.

As always there are pros and cons. By preserving these distinctions we complicate more or less the CoNLL-U specification. I can imagine spaces between sentences are important for training segmenters on noisy data (where such spaces are missing), but if this is the only application, we could sacrifice the possibility of using CoNLL-U for such training.

That said, I like @fginter 's suggestion "include the sentence as a comment, leave the rest as is". If someone needs character offsets for the explicit alignment, they can always use it, but I am not sure we should standardize it now.

amir-zeldes commented 8 years ago

I think the use case for preserving whitespace has to do with aligning the dependencies to a different format. Some multilayer corpora might have typographical or document-structural annotations that refer to stand-off offsets in an original plain text or HTML file, often with XML + x-pointer (examples include the PAULA XML format and GrAF). NLP tools might also need to return dependency information back to a representation that preserves whitespace (for example, the spacy toolchain keeps it throughout).

Even within a pure treebank, sometimes the reason for an automatic sentence split by a sentence tokenizer is based on multiple white space (think of a date and author line not separated by period, but by multiple white spaces), so that the whitespace lets you reconstruct why sentence splitting was done in a certain way.

Also @dan-zeman - I know you opposed delimiters for original sentence text in #273 but maybe it's worth considering it given some of the problems that SpaceBefore/After and offsets raise. Then offsets could be marked only if they are non-trivial with respect to the original sentence text, and normal spaces before and after could go unannotated.

dan-zeman commented 8 years ago

@amir-zeldes - I would personally prefer saying that characters before and after the last token of a sentence are not part of the sentence. If it is necessary to store such inter-sentence material, it should not go to the text attribute, and (on a second thought) I would not put it to individual tokens either. I would put it to sentence-level comment different from text. And I now lean towards not standardizing it here.

foxik commented 8 years ago

I agree that having a sentence comment relying on spaces is weird.

Note that there seemed to be people interested in the functionality (preserving multiple spaces). There are also people in companies asking us about this. I thing it would be a shame not to deal with the multiple spaces between tokens just because it seems complicated and nonelegant.

Personally I am in favor of the original proposal -- the SpacesAfter and SpacesBefore. The cons that have been brought up:

escaping: yes, escaping is a hassle, but any space-preserving approach will have to deal with this as spaces and newlines have specific meaning in CoNLL-U; also if you are not interested in multiple spaces, you can just ignore the escaping
where to put them: in the end, we decided the features can be put anywhere (with recommendation to prefer SpaceAfter)

Pros from my point of view: except for the technical issue of escaping, SpacesBefore/SpacesAfter seem to be straightforward and obvious. If someone is happy with SpaceAfter, they can keep using it, but if they need more functionality, they can extend it easily.

I think it makes little sense to standardize the character offsets without allowing spaces in various places of the original sentence text (i.e., disallowing newlines inside and at the end, which is not possible without newlines) -- it cannot be used to preserve the spaces.

Just a motivation for preserving spaces at the end of sentence -- in English, some sentence breaks seem to be forced by paragraph ends in the text. If we had the full spacing information, we would now it, and would not train the segmenter on these cases.

foxik commented 8 years ago

I am not sure what the current consensus is. To sum up:

me, @liljao, @sebschu expressed interest in the proposal
@dan-zeman seemed neutral
@andre-martins was concerned if we wanted to also support replacements and reorderings (@dan-zeman and I stated that we probably do not)
@fginter was against the SpacesAfter/Before design and suggested to use character offsets, possibly including the sentence in the comment
@martinpopel liked @fginter's suggestion to "include sentence in the comments and leave the rest as it is"
the "sentence in the comments" would be probably stored without preceeding/following spaces

Personally I am still interested in the proposal (the SpacesAfter/SpacesBefore with escaping), for the following reasons:

The inter-sentence spaces sometimes imply sentence splitting (humans also split sentences using visual cues like font size and line breaks). Without this information, it is more complicated to infer sentence breaks in a raw text. The inter-sentence spaces might also suggest paragraph or document breaks. Note that inter-sentence spaces are not be covered by "sentence in the comment" approach
Ability to reconstruct the original text. Note that not all spaces are equivalent -- for example, it makes sense to keep for example
- non-breakable spaces
- word joiners (characters which can be used to split words but which are not visible spaces -- similar to SpaceAfter=No, but they are present in the text to indicate the word boundary)
The reconstruction ability is useful both for UD treebanks, and for NLP pipelines (where not only spaces, but also some metadata might be reconstructed)

I understand that I can use the feature without standardizing it, but I believe it is useful to more people, so I think it would be better to standardize it (and not let multiple people reinvent it, each dealing with the tricky corners like escaping and inter-sentence spaces differently). Also note that if someone is not interested in more details than SpaceAfter=No provides, they are not affected at all.

Any thoughts, anyone?

fginter commented 8 years ago

Hi. My thoughts:

I think @foxik is right that storing the whitespace around the the sentence and, in effect, being able to learn sentence splitters from the data is a good property
I still don't think the MISC field is the place for this stuff to go because of the heavy escaping needed
I think it would be fine to say (and trivial to implement) that all whitespace after # text: matters and if you catenate the texts with # text: removed, you get the original document in a pure text form.
The current requirement is (and I think should remain) that concatenating all tokens will give you the exact non-whitespace characters in the # text: field. Tokenizers should not eat or invent characters. This has the happy consequence that finding out the exact offsets of the tokens in the text, reconstructing any whitespace before or after, etc is pretty much a single for loop and can be coded into whatever libraries we have for dealing with conllu.
If a tokenizer really finds it important to replace double quotes with something else, why not use the token-word distinction. I.e. the token is whatever is in the sentence, and the word is whatever the tokenizer wants to make it and the word is then part of the tree.
I don't think reordering, intervening HTML, etc are a thing to add into conll-u. I think this format is primarily meant for communicating UD trees and just won't bend into a universal document encoding format.
I think allowing any extra whitespace as part of the text metadata item will give us the ability to learn sentence splitters and tokenizers from the conllu data, which is a fine feature to have.
Only newline and backslash would need to be escaped, everything else can stay verbatim.
We can keep nospaceafter as a simple approximation for those who just want a basic tokenizer which will still work in the vast majority of the cases.

How about?

F

fginter commented 8 years ago

A text like:

I have
a dog.

The dog
is red.

would then look like

#text: I have\na dog.\n\n
1 I
2 have
...

#text: The dog\nis red.
1 The
2 dog
...

Note:

Of course the newlines between the sentences can be split any which way between them, but I don't think it really matters. People will simply have to deal with it. Or we say that except for the first sentence, whitespace is only at the end. Or something such.
Most treebanks won't have this data, seeing they don't have the original texts either. :)

foxik commented 8 years ago

Thanks @fginter for writing. My thoughts:

if we have the whole sentence including inter-sentence spaces, that would give us all features for UD treebanks
note that spaces at end of sentences are still problematic -- preserving spaces at end of lines is not very reliable (some people's editors might automatically remove them); similar issue is at the beginning of the sentence (if we use syntax from #273, is text = Hi the same as test = Hi (two spaces before Hi)
the inter-sentence space issues could be solved by escaping spaces and tabulators (\s and \t; only inter-sentence are really needed, spaces inside sentences could stay normal, even if it may be a bit surprising)
we should probably also escape \r if we are escaping \n
personally I think you are wrong about CoNLL-U being used only as UD treebanks format -- we are already using it in NLP pipelines and I believe it will become even more common

BTW, note that the "sentence in the comments" approach have the same issues as SpacesBefore/SpacesAfter -- there will be some escaping, and it is ambiguous where to put inter-sentence spaces. Also both play nice with SpaceAfter=No. So I believe the issues are generic for the problem we are trying to solve.

From my point of view, SpacesBefore/SpacesAfter are still the first choice -- the information is directly attached to the tokens (and not in the comments), so you do not need to reconstruct the information if you are interested in knowing "what spaces followed this token". In the "sentence in the comment" approach, even for "what spaces follow the sentence" question you have do something like finding the last token in the sentence text first, because it is nontrivial to say which characters at the end of sentence text do not belong in the last token (some space-like characters are not in Zs unicode category -- if you find a character in Cf category, which may really happen, does it belong to the token or is it a space between sentences?) Also, for common cases (single spaces between tokens), the SpacesBefore/SpacesAfter are more efficient (they would not be present, except probably for SpaceAfter on the last token; while sentence text has to repeat all tokens and spaces).

PS:

If a tokenizer really finds it important to replace double quotes with something else, why not use the token-word distinction. I.e. the token is whatever is in the sentence, and the word is whatever the tokenizer wants to make it and the word is then part of the tree.

Yes, using token-word distinction for replacements is an elegant approach I was thinking about in the past. However, it is not obvious whether it is allowed by the current CoNLL-U format, because only "multi-word" tokens seem to be allowed. Specifically, I assume that single-word tokens would use a range like x-x, so something like:

1-1 ˝
1 "

If you think it is worth it (I do, so that would be two of us), we could create another issue discussing it.

fginter commented 8 years ago

I would simply say that the text property is special in the sense that whitespace matters and that's it
There's so many whitespace and invisible characters in Unicode that I would simply only escape the newlines and nothing else - that is the only thing that must be escaped as far as I can see
I don't have a solution for "editors silently eat line-final whitespace" - do we really have to care? If yes, we'd need to do something like # text: <textgoeshere>#endtext ... I'd kind of hate to have to this, but ...well... a hack is a solution too :)
Saying "only use initial whitespace on the first sentence and all other is trailing" resolves ambiguity on where to put it
I'm sure people use conllu in pipelines - I know I do - but I think we need to draw a line on what we standardize and what we leave to peoples' creativity. I think right now, with the text, we'd have a very broad set of things covered.

I think

1-1 ˝
1 "

would be quite okay.

fginter commented 8 years ago

PS:

yes, you can reconstruct where the tokens start and end because they are repeated in the FORM fields of the syntax annotation, so you can loop over characters in the FORM fields and keep a counter in the text line. I've done that many times, it's just few lines of code and could be in a library too. I don't think we need to assume all information is pre-generated and present in the conllu files.

foxik commented 8 years ago

Thanks @fginter for your answers. I think we both know each other's positions (and that they are different :-)

Any other opinions and preferences, anyone?

spyysalo commented 8 years ago

# text: (or # sentence-text:) is already used and there appears to be general support for standardizing it in #273. As this proposal has (partially) overlapping goals and no clear consensus, I would suggest to postpone decisions until # text: is standardized (or removed, if so happens).

After we know how # text: gets defined, we can re-examine whether there are use-cases for spacesAfter and spacesBefore that it doesn't cover.

I would generally prefer to avoid redundant ways to specify the same information, as this adds unnecessary complexity and ways to have internally inconsistent data.

(FWIW, although I'm not deeply invested in this feature, I'm not thrilled with SpaceAfter and would prefer not to extend it; it just doesn't feel like the Right Thing™ to me.)

dan-zeman commented 8 years ago

OK, here is my position:

I prefer to have all this information at token level and not in sentence comments. So that I can collect the original raw text while iterating over tokens, and ignoring the comments.
No objection to # text:, but as @spyysalo writes, it would then be redundant, which has a downside, too.
However, I still object to the trailing whitespace being significant anywhere (this is not about invisibility in the first place, but about unreliability of editors, so I am speaking only about \s\t\r\n but not about control or non-ASCII characters).
I don't object to standardazing SpacesAfter and SpacesBefore within the UD documentation but I don't feel the urge to do so. If you (@foxik) manage to use it for what I believe you want to use it, then it will become a standard without even being mentioned on the UD website :-)

martinpopel commented 8 years ago

And here are my opinions:

We still need # text = for other purposes (e.g. in pipelines for storing documents after segmentation and before tokenization). I think we have agreed in #273 we want to standardize it (no one objected), we are just discussing details of the format (= vs :). I would suggest that # text = is not required in CoNLL-U, but recommended (and maybe required in future UD treebanks).
We can allow \n\r\s\t in # text =. The disadvantage would be that in a typical case (one space between sentences), there will be extra \s at the end. In agreement with @dan-zeman, I still consider this much better than relying on editors not deleting trailing whitespaces. Alternatively, we could enclose the whole sentence text in quotes (and escape quotes), but this means that in the typical case the quotes will be needed because of the space after fullstop ("The dog is red. ") and we will probably still need \n and \t.
With such sentence-level attribute # text =, the token-level attributes SpacesAfter and SpacesBefore and actually also SpaceAfter=No are redundant. We still may want to keep them (at least SpaceAfter=No) as they are faster in the use cases when we need them at token level. However, if the main use case is to reconstruct the original raw sentence, then it is faster to use directly # text =.
I agree with @foxik and @fginter we should introduce one-word tokens (e.g. for 1-1 ˝ -> 1 ") and discuss it in a new issue.
I think we cannot reliably store HTML (or other markup) in CoNLL-U even with all the suggestions in this thread (e.g. markup inside words).

In general (probably as everyone here), I would like to keep CoNLL-U simple, uncluttered and intuitive, especially in the typical use case. For this reason, I was originally against SpacesAfter and SpacesBefore as it complicates the specification and is redundant once we have # text =. However, if we want to standardize storing spaces between sentences, the original @foxik's proposal sounds good - there is no need to store \s at the end of each # text = because one space after each token is the default case. So SpacesAfter will be used only if there are multiple spaces between tokens or other whitespace than space. This is relatively rare, so a typical file will remain uncluttered. Thus my current position is neutral.

amir-zeldes commented 8 years ago

I also agree that:

SpaceBefore / SpaceAfter can be useful and should be allowed
I personally prefer # text= and plan to use it
I support the suggestion to allow one word tokens à la 1-1
Just because we're all reiterating positions :) much like @martinpopel also mentioned, I prefer delimiting with quotes to trailing \s, but this may be an aesthetic preference (# text="This is some sentence. "). It's pretty easy to just chop off the leading and trailing quotations, and editors won't truncate anything.
Note that delimiters also solve the leading space problem, and technically a trailing \s could be any whitespace (I thought the standard for escaping the actual space symbol is '\ ' like in *NIX). With delimiters you only need to escape the ", \n, \r since tabs will be allowed inside the delimiters, spaces are fine too, and any type of Unicode space-like character as well.

dan-zeman commented 8 years ago

FWIW, I do not like abusing the multi-word token ranges for single-word replacements (1-1). These are meant for grammar-conditioned contractions. Mere presence of such lines says something about the language, while this would be just about the particular tokenization/normalization approach. Although it could be distinguished by i being equal to j, I strongly oppose it.

Any modifications relating to just one word should stay at the line dedicated to that word. As said earlier, this relates to my proposal in #330. A MISC attribute containing the modified text (if the original stays in FORM; or the original text, if FORM is modified) is all what is needed.

reckart commented 7 years ago

I might have missed it - this thread is quite long - but did you consider text between sentences, e.g.

#text: I have a dog.
1 I
2 have
...

#text: \n\n  

#text: The dog is red.
1 The
2 dog
...

So here would be a segment of text with two line breaks between the two sentences.

Cf. discussion on this over at WebAnno: https://github.com/webanno/webanno/issues/313#issuecomment-232486111

dan-zeman commented 7 years ago

We did not reach a consensus (I think) but we did consider it. If it is represented it will be at least technically part of either the preceding or the following sentence. Whether it is part of the #text comment (which I personally do not welcome) or some other type of comment (either sentence-level comment, or token-level MISC attribute), is mostly what this thread is about.

spyysalo commented 7 years ago

Closing as v2 is now being published.

For reference, comments used to specify the original sentence are in v2 with the format # text = (http://universaldependencies.org/format.html). These permit but do not standardize escape sequences (e.g. # text = ...\n\n). SpaceAfter is likewise in and its use is strongly encouraged:

[...] information on original word segmentation should be kept if available. Every token after which there was no space in the original text should contain SpaceAfter=No in its MISC field.

SpacesAfter and SpacesBefore are not part of the v2 standard, but as the MISC field is free-form excepting the minimal constraint that it "has to be formatted as a list that can be split on the bar character (|) without special escaping", particular applications are not prohibited from using these.

UniversalDependencies / docs

Allow representing all space characters of the original text in the CoNLL-U format. #332