`--check` doesn't work with `--end-of-line=crlf`

Cube707 commented 2 years ago

Describe the bug

context running mdformat --check --end-of-line=crlf on a correctly formated markdonw file result in error.

expectation I expected mdformat to succede and nothing to happen

bug But instead an error is rased that the file is not formated:

$ mdformat --check --end-of-line=crlf test.md
Error: File "D:\Downloads\temp\mdformat\test.md" is not formatted.

related code

I belive the problem thems from here:

https://github.com/executablebooks/mdformat/blob/24f9d25960d28efc75cbfa051881bac5fc46a842/src/mdformat/_cli.py#L85-L87

debugging reveald the content of formatted_str and original_str to have different lineendings:

original_str: '# Test file\r\n\r\nHello world\r\n'
formatted_str: '# Test file\n\nHello world\n'

I assume that the formating process internally does a universal newline conversion, resulting in all newline types beeing converted to \n

Reproduce the bug

create a well formated mardkown file test.md with dos-style line endings (\r\n):
```
# Test file

Hello world
```
run mdformat on it with the --check and --end-of-line=crlf options:
```
mdformat --check --end-of-line=crlf test.md
```
get error

List your environment

OS: windows 10 Python: v3.10.6 mdformat: 0.7.15

welcome[bot] commented 2 years ago

Thanks for opening your first issue here! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out EBP's Code of Conduct. Also, please try to follow the issue template as it helps other community members to contribute more effectively.
If your issue is a feature request, others may react to it, to raise its prominence (see Feature Voting).
Welcome to the EBP community! :tada:

hukkin commented 2 years ago

Thanks for the issue!

Yeah this seems to need better tests and maybe rethink the whole if opts["check"]: block.

Cube707 commented 2 years ago

I belive this could benifit from rethingking the whole line-endings proccedure. We should probaly do something like this:

read the markdown-file with no lineending conversion applyed (open(file, "r", newline=""))
figure out and store the first newline character
apply universal newline conversion by wrapping the text in a second layer of StringIO
pass the content on, now that it matches python conventions and should be what markdown_it expects and returns
write out the formated file, providing the newline character as per the options

proof of concept:

from io import StringIO

with open("test.md", "r", newline="") as fp:
    firstline = fp.readline()
    if firstline[-2:]=="\r\n":
        newlinechar = "\r\n"
    elif firstline[-1:]=="\r":
        newlinechar = "\r"
    else:
        newlinechar = "\n"

    # show the unconverted file contents:
    fp.seek(0)
    print(fp.read().encode())

    fp.seek(0)
    mdbuffer = StringIO(fp.read(), newline=None)  # apply universal newline conversion

original_str = mdbuffer.read()  # we dont actually need to save the buffer, but maybe its usefull??

# proof that we have the correct newline char and how the string looks now
print(f"newline={repr(newlinechar)}")
print(original_str.encode())

# ---
# here we would do the actuall formating (so basicly the existing code)
# ---

# write out the file, applying the original newlines (or change per options):
with open(1, "w", newline=newlinechar) as stdout:
    stdout.write(original_str)

results:

Unix-Newlines: ```console b'# Test file\n\nHello world\n' newline='\n' b'# Test file\n\nHello world\n' # Test file Hello world ```

DOS-Newlines: ```console b'# Test file\r\n\r\nHello world\r\n' newline='\r\n' b'# Test file\n\nHello world\n' # Test file Hello world ```

Mac-Newlines: (note how the output gets overriden because the `\r` is making the cursor return) ```console b'# Test file\r\rHello world\r' newline='\r' b'# Test file\n\nHello world\n' Hello world ```

this would however require a lot of rewriting and changes to the existing code, but could assist with the unification of the api and cli codebases

Cube707 commented 2 years ago

after thinking a bit more about this I realised that we require the ability to detect when files where changed by the newline-conversion procces (for example on mixed newlines or when a option overrides the used newline character). So this version allowes for that:

from io import StringIO

with open("test.md", "r", newline="") as fp:
    firstline = fp.readline()
    if firstline[-2:]=="\r\n":
        newlinechar = "\r\n"
    elif firstline[-1:]=="\r":
        newlinechar = "\r"
    else:
        newlinechar = "\n"

    fp.seek(0)
    original_filecontent = fp.read()

# show the unconverted file contents:
print(original_filecontent.encode())

original_str = StringIO(original_filecontent, newline=None).read()  # apply universal newline conversion

# proof that we have the correct newline char and how the string looks now
print(f"newline={repr(newlinechar)}")
print(original_str.encode())

# ---
# here we would do the actuall formating (so basicly the existing code)
formated_str = original_str
# ---

formated_filecontents = StringIO(formated_str, newline=newlinechar).read()
print(formated_filecontents.encode())

print( formated_filecontents == original_filecontent )

Cube707 commented 2 years ago

and there is also the IncrementalNewlineDecoder class, which is what is used internally on universal newline mode, so it might be usefull to use and allow for better performance

hukkin commented 2 years ago

It's not that clear to me what benefit the use of StringIO etc provides for us here. I think https://github.com/executablebooks/mdformat/pull/348 is sufficient to fix this issue.

Cube707 commented 2 years ago

The benifit would be to rely on python existing universal newline magick instead of handeling it with search and replace opterations ourself (and mostlikle missing edgecases in the procces)

We could handle newline stuff only in the two places they are actually relevant, when reading and writing a file, and keep the rest of the codebase free of this problem and rely on the python convention that all strings use \n.

Also this would make it easyer to handle stdin, as the same logic could be applyed. Its just another buffer after all

Cube707 commented 2 years ago

But jea. #348 is a good fix for now. This is more of an idea for if the arcetecture gets a big generall overhaul

executablebooks / mdformat