Closed trantor closed 6 years ago
I've just fixed the alignment for lines without newline (EOF) and pushed it in https://github.com/Tux/App-ccdiff/commit/3def309330a1ac780fd9f625bc75a87d7ff530b9 But as I am currently restructuring the file to be able to write tests, A cannot guarantee that I did not break anything.
The BOM (Byte Order Mark) and ZERO WIDTH NO-BREAK SPACE both have Unicode codepoint U+00feff (http://unicode.org/faq/utf_bom.html#bom1). If it is the first character of a stream, it is a BYTE ORDER MARK, otherwise it is the ZERO WIDTH NO-BREAK SPACE. I learned something today :) As I use the charnames functionality, I use the name that perl gives me for the codepoint, which is that long name.
Now that's... bizarre. Anyway, you've got both U+00A0 and U+FEFF in the same line. At this point, wouldn't it be helpful to repeat ZERO WIDTH NO-BREAK SPACE since there are two occurences and/or specify the codepoint as well (maybe as a -v3 output)?
It is not bizarre if you think about it: this enables concatenating streams without breaking it. If I fetch both files, I see no A0 in either. Can you put a .tgz or .zip with both files here?
On a practical level you are right. I just find context-dependent interpretation of a character something fraught with potential misunderstandings. And you are right about the absence of U+00A0. I got confused with a few sample files I tried out and actually got the bug description wrong. Scratch that. Silly me. Consider printing the codepoint for a possible v3 as a feature request. I'll edit the original bug description to correct my blunder.
No harm done. -v3 now shows codepoints (onder -U only). Pull to verify
Hello Tux. The man with the Zero-Width Space question at the conference here.
I was looking at cases such as the presence of Zero-Width Spaces and also the presence of a BOM character.
I took a couple of sample files from here and modified them a bit.
One of the files starts with a BOM character
and ends with a ZERO WIDTH NO-BREAK SPACE. No trailing newline anywhere.Apparently the BOM character is displayed without being named in the verbose output.Also the markers appear to have some alignment problem. From what I've seen the latter phenomen doesn't seem to take place with a much shorter string.Sample files attached below.
TEST_B.txt TEST_A.txt