cmap / cmapPy

Assorted tools for interacting with .gct, .gctx files and other Connectivity Map (Broad Institute) data/tools
https://clue.io/cmapPy/index.html
BSD 3-Clause "New" or "Revised" License
126 stars 76 forks source link

Bug .GCT written by cmapPy on Windows have inconsistent line endings #77

Open KarlClauser opened 2 years ago

KarlClauser commented 2 years ago

Hi Lev,

Bug .GCT files written with cmapPy on Windows, show alternating blank lines after the top 3 lines when opened in Excel, though fine in code editor Spyder v5.12.3

Fix: The line below writes the 1st 2 lines of a .GCT file and would otherwise default to OS line_terminator of \r\n which conflicts with all other lines terminated by \n

Inconsistent line endings probably tricks Excels auto line ending recognition

C:\ProgramData\Anaconda3\Lib\site-packages\cmapPy\pandasGEXpress\write_gct.py #line 102

Write top_half_df to file

#top_half_df.to_csv(f, header=False, index=False, sep="\t")
top_half_df.to_csv(f, header=False, index=False, sep="\t", line_terminator='\n')

Please incorporate into next version. Screenshots attached.

Thanks,

--Karl cmapPybug_inconsistentLineEndings.docx

levlitichev commented 2 years ago

Good catch. I think the better change would be to replace \n with os.linesep in write_version_and_dims:

https://github.com/cmap/cmapPy/blob/d1652c3223e49e68e3a71634909342b4a6dbf361/cmapPy/pandasGEXpress/write_gct.py#L64-L65

KarlClauser commented 2 years ago

Won't that lead to \r\n line endings on Windows? We should be striving to get the entire file to be \n line endings. I seek to have a file that is identical, no matter whether it is written in linux or windows. That is how I encountered this bug.

--Karl

levlitichev commented 2 years ago

I understand your point, but I feel that it would be wise to follow the convention chosen by pandas to use system-specific line terminators. I confirmed (on my Mac) that the file looks the same when opened in Excel if all the terminators are either all \n or all \r\n.

f = open("A.txt", "w")
f.write(("A" + "\n"))
f.write(("B" + "\n"))
f.close()

g = open("B.txt", "w")
g.write(("A" + "\r\n"))
g.write(("B" + "\r\n"))
g.close()
KarlClauser commented 2 years ago

Hi Lev,

I'm a Windows guy and now-a-days Windows programs can routinely handle '\n' line terminators. For reading/writing .GCT files people upstream/downstream of me use Macs. Consequently it is important to be able to read/write and get the same result. If you force windows generated files to be '\r\n' then I'm going to have fix every one I produce with cmaPy to get the desired '\n'.

Would you please at least provide an option to specify the line terminator to be written? So long as I have a means to get '\n' then I don't care which you choose as a default.

Thanks,

--Karl