Closed jairideout closed 10 years ago
Should follow this guide for subclassing numpy arrays.
After doing a bit of research, I have some design and implementation decisions I'd like to get input on.
There are a number of routes we could take:
numpy.ndarray
and have an extra SampleIds
member, similar to biom.table.Table
. Pros: can use this like any other numpy array, and will have great performance. Cons: not sure how to handle things like slicing or view casting, since this could potentially end up with a non-square matrix (in the slicing case) or a DistanceMatrix
without sample IDs (in the view casting case). We could have explicit checks that disallow view casting and slicing (and other operations) that result in a non-square matrix, or maybe return an ndarray
from these ops instead of a DistanceMatrix
?numpy.recarray
(or create a structured array) which will allow us to attach sample IDs to the columns. We'll then be able to access columns by sample ID and rows by index. There's still the problem of slicing to create an invalid distance matrix, and not sure how view-casting will work. As far as I know, there isn't a built-in way to attach row and column labels.SampleIds
and Data
members, where Data
is a numpy array. Users of the class can access the numpy array via the Data
member for fast array ops as necessary (or for passing a pure numpy array to some pycogent function like PCoA, Mantel test, etc.). This would be closest to what qiime.util.DistanceMatrix
is now, and similar to our use of the (labels, dm)
tuple. This is also the simplest approach to implement and is similar to biom.table.Table
. The con is that if you want the actual numpy array, you have to access it via my_dm.Data
. This doesn't seem like a huge deal to me, though.DataFrame
, which allows for data tables with row and column labels. This would require a new QIIME dependency and requires numpy 1.6.1+. I don't really like this approach...I'm leaning toward the third option because it is simple and will easily integrate with the QIIME codebase. However, I think the first option may also be viable, though I'm not sure how best to handle view casting and "new from template" (e.g. slicing) operations. Thoughts?
The new DistanceMatrix
class was just merged into bipy (https://github.com/biocore/bipy/pull/42), so once bipy becomes a QIIME dependency, we can start updating the code to use it for 1.9.0.
As part of QIIME's stable API,
qiime.util.DistanceMatrix
needs to be replaced with something better. It should subclass a numpy array (need to determine which one) in order to easily replace the existingDistanceMatrix
class and the more common(labels, dm)
tuple used throughout QIIME. This will be a dense matrix.