Tux / App-ccdiff

A colored diff that also colors inside changed lines
Artistic License 2.0
18 stars 5 forks source link

Alignment issues in long strings without trailing newlines #14

Closed trantor closed 6 years ago

trantor commented 6 years ago

Hello Tux. The man with the Zero-Width Space question at the conference here.

I was looking at cases such as the presence of Zero-Width Spaces and also the presence of a BOM character.

I took a couple of sample files from here and modified them a bit.

One of the files starts with a BOM character and ends with a ZERO WIDTH NO-BREAK SPACE. No trailing newline anywhere.

Apparently the BOM character is displayed without being named in the verbose output. Also the markers appear to have some alignment problem. From what I've seen the latter phenomen doesn't seem to take place with a much shorter string.

ccdiff -v2 -U -m  TEST_A.txt TEST_B.txt

1,1c1,1
< ↱↰||2.0|A|1|2009-08-16T16:30:00|1|2009|ingreso|Una sola exhibición|350.00|5.25|397.25|ISP900909Q88|Industrias del Sur Poniente, S.A. de C.V.|Alvaro Obregón|37|3|Col. Roma Norte|México|Cuauhtémoc|Distrito Federal|México|06700|Pino Suarez|23|Centro|Monterrey|Monterrey|Nuevo Léon|México|95460|CAUR390312S87|Rosa María Calderón Uriegas|Topochico|52|Jardines del Valle|Monterrey|Monterrey|Nuevo León|México|95465|10|Caja|Vasos decorados|20.00|200|1|pieza|Charola metálica|150.00|150|IVA|15.00|52.50||<  ▼                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               <  -- verbose : ZERO WIDTH NO-BREAK SPACE
---
> ||2.0|A|1|2009-08-16T16:30:00|1|2009|ingreso|Una sola exhibición|350.00|5.25|397.25|ISP900909Q88|Industrias del Sur Poniente, S.A. de C.V.|Alvaro Obregón|37|3|Col. Roma Norte|México|Cuauhtémoc|Distrito Federal|México|06700|Pino Suarez|23|Centro|Monterrey|Monterrey|Nuevo Léon|México|95460|CAUR390312S87|Rosa María Calderón Uriegas|Topochico|52|Jardines del Valle|Monterrey|Monterrey|Nuevo León|México|95465|10|Caja|Vasos decorados|20.00|200|1|pieza|Charola metálica|150.00|150|IVA|15.00|52.50||

Sample files attached below.

TEST_B.txt TEST_A.txt

Tux commented 6 years ago

I've just fixed the alignment for lines without newline (EOF) and pushed it in https://github.com/Tux/App-ccdiff/commit/3def309330a1ac780fd9f625bc75a87d7ff530b9 But as I am currently restructuring the file to be able to write tests, A cannot guarantee that I did not break anything.

Tux commented 6 years ago

The BOM (Byte Order Mark) and ZERO WIDTH NO-BREAK SPACE both have Unicode codepoint U+00feff (http://unicode.org/faq/utf_bom.html#bom1). If it is the first character of a stream, it is a BYTE ORDER MARK, otherwise it is the ZERO WIDTH NO-BREAK SPACE. I learned something today :) As I use the charnames functionality, I use the name that perl gives me for the codepoint, which is that long name.

trantor commented 6 years ago

Now that's... bizarre. Anyway, you've got both U+00A0 and U+FEFF in the same line. At this point, wouldn't it be helpful to repeat ZERO WIDTH NO-BREAK SPACE since there are two occurences and/or specify the codepoint as well (maybe as a -v3 output)?

Tux commented 6 years ago

It is not bizarre if you think about it: this enables concatenating streams without breaking it. If I fetch both files, I see no A0 in either. Can you put a .tgz or .zip with both files here?

trantor commented 6 years ago

On a practical level you are right. I just find context-dependent interpretation of a character something fraught with potential misunderstandings. And you are right about the absence of U+00A0. I got confused with a few sample files I tried out and actually got the bug description wrong. Scratch that. Silly me. Consider printing the codepoint for a possible v3 as a feature request. I'll edit the original bug description to correct my blunder.

Tux commented 6 years ago

No harm done. -v3 now shows codepoints (onder -U only). Pull to verify