cmap / cmapPy

Assorted tools for interacting with .gct, .gctx files and other Connectivity Map (Broad Institute) data/tools
https://clue.io/cmapPy/index.html
BSD 3-Clause "New" or "Revised" License
124 stars 74 forks source link

pandasGEXpress.write_gctx id in gctx is null #55

Closed ghost closed 5 years ago

ghost commented 5 years ago

In python 3 environment (not python 2), if I create a GCToo object where e.g. the index of the data_df is an integer (not a string or a float), and then call write_gctx it produces a gctx file where the all of the row ID entries (/0/META/ROW/id) are empty string ''.

The problem appears to be this line of code: https://github.com/cmap/cmapPy/blob/59d833b64fd2c3a494cdf67fe1eb11fc8008bf76/cmapPy/pandasGEXpress/write_gctx.py#L164

numpy.string_(x) returns b'' for integer. Note it works fine if x is a str or a float.

For example in python 3 numpy.string_(3) returns b'' whereas in python 2 it returns '3'.

I've submitted an issue to numpy about this behavior (https://github.com/numpy/numpy/issues/13427), might make sense to wait to hear back from them before taking any action here.

ksunden commented 5 years ago

Note, it is not None, it is a length n bytes array (which gets printed as empty bytes string, but if you interrogate the length you will find it to be n)

ghost commented 5 years ago

Good point, edited to correct

saksham219 commented 5 years ago

A workaround could be to convert the integers into float before calling the write_gctx function. Also, we can explicitly check for integer row indexes inside write_gctx function, and convert them to float before writing.

ghost commented 5 years ago

I converted int --> str to avoid any ambiguity for my workaround

I'm thinking in the line of code you could do [numpy.string_(str(x)) for x in metadata_df.index] as the fix since it sounds like numpy is not going to fix it

oena commented 5 years ago

Fixed in PR#56 thanks to @saksham219!