chanzuckerberg / ExpressionMatrix2

Software for exploration of gene expression data from single-cell RNA sequencing.
MIT License
28 stars 4 forks source link

Need to do a UX review of the API #2

Closed bkmartinjr closed 6 years ago

mckinsel commented 6 years ago

Just to get the ball rolling, here are a couple quick comments. @paoloczi, any thoughts?

  1. Exposing the explicit container types is a little weird in python. There's no real notion of, say, a UintFloatPairListList in the language. So instead of sticking c++ vectors into python, it's better from a usability perspective to work with plain python lists, do the PyObject casts, and fail if it doesn't work. Similarly, python doesn't have c++ pairs, but it does of tuples of length 2.

  2. The methods with incremented names like findSimilarPairs0, findSimilarPairs1, findSimilarPairs2 are a little tough to figure out.

  3. Most of the methods that do any heavy lifting don't return anything. You would expect something like createCellGraph to return a CellGraph or findSimilarPairs to return some similar pairs. Keeping everything within an opaque object means you have to externally manage a catalog of cell set names, gene set names, similar pairs names, cell graph names, and cluster graph names.

  4. There are some common tasks that are hard to do. For example, when I've used this I wanted to iterate over all the cells and get each cell's cluster assignment from a particular clustering. Ideally this could be something like

    [(cell, exp_mat.get_cluster_assignment(cell, "my_clustering")) for cell in exp_mat.cells]

    But it's not clear how to do that with this API.

paoloczi commented 6 years ago

Exposing the explicit container types is a little weird in python. There's no real notion of, say, a UintFloatPairListList in the language. So instead of sticking c++ vectors into python, it's better from a usability perspective to work with plain python lists, do the PyObject casts, and fail if it doesn't work. Similarly, python doesn't have c++ pairs, but it does of tuples of length 2.

I want to minimize the extent to which Python entities are propagated into C++ code. In other words, I want the majority of my C++ code to know nothing about the Python API. If I don't do that, my C++ code will become hard to develop and maintain. This means that the Python code has to work with C++ objects. If I keep it this way, it will also be possible to interface the same C++ code with other environments (for example, R, or other C++ code -for example, the current .so not only works as a Python package, but can also be dynamically linked, unmodified, into a C++ executable that knows nothing about Python).

My container types, however (types with name ending in List) have semantics very similar to Python lists, so they can be used in a similar way, including the standard Python iteration patterns.

The methods with incremented names like findSimilarPairs0, findSimilarPairs1, findSimilarPairs2 are a little tough to figure out.

This is were my current development based on LSH is, so the names reflect the fact that in this area I am not sure where things are going to end up.

Most of the methods that do any heavy lifting don't return anything. You would expect something like createCellGraph to return a CellGraph or findSimilarPairs to return some similar pairs. Keeping everything within an opaque object means you have to externally manage a catalog of cell set names, gene set names, similar pairs names, cell graph names, and cluster graph names.

There are various reasons for this. Most of those objects are persistent (they live in memory-mapped files), and are not copiable, so it would be hard to make them accessible in Python. Others (for example, cell graphs) have a very rich API that would be hard and unnecessary to export to Python, so they are best exposed as completely opaque. And in the end, in Python, keeping a dictionary of object names is not too different from using regular Python variables, which are themselves entries in a dictionary (right? or not?).

There are some common tasks that are hard to do. For example, when I've used this I wanted to iterate over all the cells and get each cell's cluster assignment from a particular clustering. Ideally this could be something like

[(cell, exp_mat.get_cluster_assignment(cell, "my_clustering")) for cell in exp_mat.cells]

The containers I expose (type name ends in List) support this type of iteration. However, making a low level call into C++ at every iteration of the loop would be very inefficient.

In summary, it all boils down to the fact that is C++ code exposed to Python, not a Python library implemented in C++. The purpose of the Python API is to give high-level commands to the C++ code, leaving most of the logic in C++. Think of it as scripting, not really programming. Changing this would require a rewrite from scratch, and would make the C++ code usable exclusively from Python.

Paolo

On Tue, Oct 10, 2017 at 11:03 AM, Marcus Kinsella notifications@github.com wrote:

Just to get the ball rolling, here are a couple quick comments. @paoloczi https://github.com/paoloczi, any thoughts?

1.

Exposing the explicit container types is a little weird in python. There's no real notion of, say, a UintFloatPairListList in the language. So instead of sticking c++ vectors into python, it's better from a usability perspective to work with plain python lists, do the PyObject casts, and fail if it doesn't work. Similarly, python doesn't have c++ pairs, but it does of tuples of length 2. 2.

The methods with incremented names like findSimilarPairs0, findSimilarPairs1, findSimilarPairs2 are a little tough to figure out. 3.

Most of the methods that do any heavy lifting don't return anything. You would expect something like createCellGraph to return a CellGraph or findSimilarPairs to return some similar pairs. Keeping everything within an opaque object means you have to externally manage a catalog of cell set names, gene set names, similar pairs names, cell graph names, and cluster graph names. 4.

There are some common tasks that are hard to do. For example, when I've used this I wanted to iterate over all the cells and get each cell's cluster assignment from a particular clustering. Ideally this could be something like

[(cell, exp_mat.get_cluster_assignment(cell, "my_clustering")) for cell in exp_mat.cells]

But it's not clear how to do that with this API.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/ExpressionMatrix2/issues/2#issuecomment-335558145, or mute the thread https://github.com/notifications/unsubscribe-auth/AYNxTdu6ETr8LBXHLyCNRNXZl3NFMVA3ks5sq7FjgaJpZM4P0Qkt .

freeman-lab commented 6 years ago

Great discussion! Chatted a bit with @mckinsel offline, here are some thoughts...

It seems like there's potential value in a friendly and functional Pythonic interface, with clean inputs and outputs for core operations, designed for Python end users. To that end I like all of @mckinsel 's API suggestions, and might have a few others. But also totally get @paoloczi that we don't want to completely refactor the logic of the C++ library! And we want to keep it easily usable from C++.

So, I think somewhere we need some extra, probably messy glue code, to translate from the C++ semantics to something more Pythonic. We could do that in a separate codebase, so it doesn't muck up what's here. In other words, this repo just exposes a clean C++ API. The Python translation code becomes its own little project, as would an R translation. The only downside is the time to write and maintain that glue code, but it could be worthwhile it if we really want to enable Python users.

The only caveat is whether the memory management underlying the ExpressionMatrix object makes certain semantics in Python strictly impossible — this is the the "hard to make them accessible" part of @paoloczi 's response :) But @mckinsel suggested he can play with a small test case to see what's possible there, which seems super useful to me!

paoloczi commented 6 years ago

For experimentation of this kind, test run ExpressionMatrix2/tests/ToyTest1 may be useful. It has 3 cells and 3 genes and is documented in the README file for the repository.

On Tue, Oct 10, 2017 at 2:03 PM, Jeremy Freeman notifications@github.com wrote:

Great discussion! Chatted a bit with @mckinsel https://github.com/mckinsel offline, here are some thoughts...

It seems like there's potential value in a friendly and functional Pythonic interface, with clean inputs and outputs for core operations, designed for Python end users. To that end I like all of @mckinsel https://github.com/mckinsel 's API suggestions, and might have a few others. But also totally get @paoloczi https://github.com/paoloczi that we don't want to completely refactor the logic of the C++ library! And we want to keep it easily usable from C++.

So, I think somewhere we need some extra, probably messy glue code, to translate from the C++ semantics to something more Pythonic. We could do that in a separate codebase, so it doesn't muck up what's here. In other words, this repo just exposes a clean C++ API. The Python translation code becomes its own little project, as would an R translation. The only downside is the time to write and maintain that glue code, but it could be worthwhile it if we really want to enable Python users.

The only caveat is whether the memory management underlying the ExpressionMatrix object makes certain semantics in Python strictly impossible — this is the the "hard to make them accessible" part of @paoloczi https://github.com/paoloczi 's response :) But @mckinsel https://github.com/mckinsel suggested he can play with a small test case to see what's possible there, which seems super useful to me!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/ExpressionMatrix2/issues/2#issuecomment-335607458, or mute the thread https://github.com/notifications/unsubscribe-auth/AYNxTeswH2W8_NDDWq-8MGYRYTHEfiJ5ks5sq9u_gaJpZM4P0Qkt .

paoloczi commented 6 years ago

Some thoughts regarding a more natural Python API, triggered by in part by work in this repo https://github.com/chanzuckerberg/pyEM2 and in part by some discussions today:

paoloczi commented 6 years ago

Regarding keyword arguments, there seems to be consensus that they are a good thing to do.

paoloczi commented 6 years ago

The pyEM2 repository fills some of these needs. There are good reasons to keep both a "lean and mean" lower level API like the one provided by ExpressionMatrix2 and a higher level, more convenient and "Pythonic" API like the one provided by pyEM2. To avoid confusing the users we concluded that it is best to keep the two repositories and their documentations separate.