cytoscape / py4cytoscape

Python library for calling Cytoscape Automation via CyREST
https://Py4Cytoscape.readthedocs.io
Other
69 stars 15 forks source link

List with float throws exception when sending pandas datatable #107

Closed ObT1337 closed 1 year ago

ObT1337 commented 1 year ago

These lines in "tables.py" make some trouble when sending a pandas dataframe:

483 for col in data_subset.columns:
484        col_val = [','.join(val) if isinstance(val, list) else val for val in data_subset[col]]
485        data_subset[col] = col_val

Probably cause to "NA" floats in the column. Cast to string as a suggestion how to fix this:

483 for col in data_subset.columns:
484        col_val = [','.join(str(val)) if isinstance(val, list) else val for val in data_subset[col]]
485        data_subset[col] = col_val
bdemchak commented 1 year ago

Thanks for reporting this ... I'll take a look at the test suite and code today. I suspect you're right on all counts. If so, I'll be releasing a new version soon. Stay tuned ...

bdemchak commented 1 year ago

Hi ... Would you mind providing the exact case that produces this failure?

I'm confused by your proposed solution as compared to the actual code. Consider the original code:

col_val = [','.join(val) if isinstance(val, list) else val for val in data_subset[col]]

The .join() is executed only if the 'val' is a list ... meaning [] or [1] or [1,2] etc. It would not be executed if 'val' is NA. Executing .join(str(val)) when val is a list (e.g., [1,2]) would produce wrong results (e.g., '[,1,,, ,2,]').

If 'val' is NA, then the .join() would never be called ... the NA would be used directly.

There are other cases, but none seem to fit your report.

There is a lot to say about py4cytoscape/RCy and Cytoscape and list types, but I'll say more once I know more about your failure case.

Thanks!

ObT1337 commented 1 year ago

Hey Barry,

Thank you for your prompt response. I agree with you; my solution doesn't really make sense. It was a quick and hacky fix that temporarily helped me with my problem.

I'm still unsure about the root cause of the issue, but it seems to occur when my data frame contains a row with entries as lists and lists themselves were filled with floats. This resulted in the following TypeError:

TypeError: sequence item 0: expected str instance, float found

However, now that I have removed my fix, I'm unable to reproduce this error...

I apologize for any inconvenience caused. It may have been an artifact or an unusual occurrence.

bdemchak commented 1 year ago

Thanks for getting back to me ...

Your latest report makes perfect sense. I can see that the current code works only for type int and generates a type error for other types.

I have a fix that doesn't cause any regressions, and I'll publish it later today ... after adding more specific tests.

Please stand by.

Thanks!

bdemchak commented 1 year ago

Hi --

I think I've fixed the immediate problem in branch 1.8.0 ... are you able to get this branch and try it out?

Note that there are some things about Cytoscape and lists ... I've written them up ... please let me know if there's anything you don't understand or causes you real problems:

Generally, Cytoscape doesn't provide a rich interface for importing or exporting list columns, and there are a number of ways to improperly transfer list columns between Cytoscape and py4cytoscape/R.

For example, while a dataframe has a concept of a 'list' column, the Cytoscape concept is 'list of ', where could be Integer, String, Float, Boolean, etc. Worse ... since a dataframe 'list' can contain base datatypes in any combination (e.g., [1,2,'blue'] or [5.0,'red',true]), there's no good way to infer the Cytoscape type that matches a dataframe 'list'.

So, py4cytoscape does its best by converting a dataframe 'list' into a comma-separated string, and allowing Cytoscape to infer the column type (which will often be String).

Another issue: for the comma-separated string to form properly, the dataframe 'list' column must contain only lists of string values (e.g., ['1','2','3']), or an exception occurs. This is bad, and py4cytoscape should be fixed so it accepts all scalar list values (e.g., [1,2,3]). This fix should allow for list values containing np.nan, too. My fix in branch 1.8.0 today does that.

Another issue: when transferring a network from Cytoscape to a dataframe, Cytoscape does return a list for a list column. BUT, a list column cannot contain any empty values. If it does, Cytoscape transfers only non-empty values without indicating which SUIDs they belong to. The result is impossible for py4cytoscape to interpret, and the entire list column is omitted from the resulting dataframe. For Notebook users, a warning is generated to the console, but for non-Notebook users, the omission is silent. It's unclear how this should really be handled.

The ultimate consequence is that a round trip between p4cytoscape and Cytoscape can produce unexpected results when columns are lists and there are missing lists.

Given Cytoscape's ambiguous and sometimes lossy handling of lists, I recommend care in when combining dataframe lists and Cytoscape list types.

bdemchak commented 1 year ago

@AlexanderPico @yihangx You might be interested in this thread, especially my post today. Please comment if you can.