SynBioDex / pySBOL3

Native python implementation of SBOL 3.0 specification
MIT License
37 stars 16 forks source link

Windows writes SORTED_NTRIPLES with CRLF #398

Closed jakebeal closed 2 years ago

jakebeal commented 2 years ago

@kenza12 has created a pull request on SBOL-factory patching a test where Windows is writing files with a CRLF rather than just a newline: https://github.com/SynBioDex/sbol_factory/pull/70

It seems to me that this patch is actually indicating a bug in pySBOL3, which shouldn't be writing CRLF in the first place.

jakebeal commented 2 years ago

I think this is the spot: https://github.com/SynBioDex/pySBOL3/blob/5ae15b4f171991b3cd216b7548ffde7902f41c12/sbol3/document.py#L464

And following the template in @kenza12's pull request, I think we can fix by adding newline='\n' to the arguments.

Note: I am considering this a bug, rather than an appropriate difference between Windows and Mac/Linux, because it will generates differences in version control where there should be none.

jakebeal commented 2 years ago

Now I am confused, because my attempt to replicate this failure is not working: the Windows test that I set up to fail if CRLF is being written is instead passing: https://github.com/SynBioDex/pySBOL3/runs/6369161798?check_suite_focus=true

@kenza12: can you reproduce a minimal example of CRLF being written in your Windows environment?

tcmitchell commented 2 years ago

@jakebeal I think this is working for you because you have set up your test to use universal newlines. Read the documentation for the newline option to open.

The test in sbol_factory uses filecmp.cmp which seems to care about the size of the file in bytes, which would be different when there are different line endings.

On to this issue:

Why are different line endings on different platforms considered a bug? My read of the spec is that a valid N-Triples file can have any combination of CR and LF line endings. I think we should focus on the files carrying the same semantic payload, and not on the details of what line endings appear.

I think the sbol_factory test cases should not use filecmp.cmp for checking the results of round tripping. pySBOL3 loads the RDF files in RDFlib and diffs the triples that got loaded, rather than worrying about a character by character difference.

jakebeal commented 2 years ago

My key question is this: what will git do? If universal newlines means that git won't see a change in newlines, then I'm fine with it.

kenza12 commented 2 years ago

Hi, Thank you for your feedback. I didn't encounter any bug with pysbol3 tests as you can see here : https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=491021&view=logs&jobId=240f1fee-52bc-5498-a14a-8361bde76ba0&j=240f1fee-52bc-5498-a14a-8361bde76ba0&t=06d62fd2-ab49-5add-9ee5-83b6882d7d2b

All the tests passed, even for test_write function in test_Document.py : https://github.com/SynBioDex/pySBOL3/blob/5ae15b4f171991b3cd216b7548ffde7902f41c12/test/test_document.py.

The bug in sbol_factory tests is appearing because of this file comparison where the assertion failed in https://github.com/SynBioDex/sbol_factory/blob/master/test/test_ontology.py as @tcmitchell mentionned. assert filecmp.cmp(temp_name, original_file), "Files are not identical"

I think you can either change the comparison way or specify the newline.

tcmitchell commented 2 years ago

@jakebeal said:

My key question is this: what will git do? If universal newlines means that git won't see a change in newlines, then I'm fine with it.

I don't understand what git has to do with it. git would definitely see a change in newlines if git were involved. That's the nature of things on different platforms. But git is not involved. We shouldn't be worried about generating byte-for-byte identical files, we should be worried about generating files that conform to their specification and that carry the same content.

There are flags in git to handle line endings across platforms, and GitHub has a good page about it. To me that's orthogonal to what this issue is about.

In short, I believe that N-Triples files, whether sorted or not, should have CRLF on Windows.

jakebeal commented 2 years ago

Per discussion, this isn't actually a bug, but is the correct behavior. I do think that we should document the potential challenge and solution in an easy place to find, and to that end have added a documentation issue on SBOL-utilities.