PySCeS / pysces

The official PySCeS project source code repository.
https://pysces.github.io
Other
34 stars 10 forks source link

make Scan1 results available as numpy record array #16

Closed jmrohwer closed 8 years ago

jmrohwer commented 8 years ago

This is a small addition to the Scan1() method to make the results of the scan available as a record array. The fields of the record array are created automatically from a concatenated list of [mod.scan_in]+mod.scan_out and populated with the numerical results from mod.scan_res. Existing functionality of mod.scan_res is unchanged. This convenience function will make it very easy to plot selective scan results when exploring models interactively, e.g.

plt.plot(mod.scan.Vf1, mod.scan.J_R1)

It is also easy to do post-operations on scan results, e.g. calculate concentration ratios. Comments?

bgoli commented 8 years ago

Hi Johann

In principle this is a good idea the only minor drawback with the current proposal is that the scan results are stored twice ... which is not so good for very large scans. What do you think of the following type thing that uses a function to generate the new array.

def getScanResAsRecArray(): "check that there are scan results" "create the record array" return "new record array"

Then all a user would have to do is: mod.Scan1(rng) scan_rec_arr = mod.getScanResAsRecArray()

This way we only store the data in one place. b.

jmrohwer commented 8 years ago

Hi Brett

Your point is well taken, and indeed had occurred to me as well. While your solution gets rid of the problem, it is almost as "bad" in terms of required typing as what I've been using in scripts up to now: scan_rec_arr = np.rec.fromrecords(mod.scan_res, names=[mod.scan_in]+mod.scan_out) Your solution is still a lot of typing. I'm thinking of a use case where one is interactively exploring a model by scans and then wants to plot or otherwise work with the data. Most of the times these are smaller models (up to say 20 rxns) with scans of say 100 points, so the size of the scan_res array is not an issue in terms of memory usage.

I agree that it can become an issue for very large scans or a very large number of scan outputs. What about a flag you can change in mod.__settings__? We can discuss whether to create the recarrary by default or not :-) The point is that your proposed solution still creates a duplicate of the results when you call your getScanResAsRecArray() method. And I don't want to call that method after each scan.

Again, this stems from my personal use case. I agree that there needs to be a way to switch it off.

J.

bgoli commented 8 years ago

Hi Johann

My editor has text completion, but point taken :-) The advantage of a get method is that it returns a copy, but doesn't store it as part of the model object itself, setting a flag is a good idea:

boolean _create_scan_resrecarray (default=True) Flag sets whether a duplicate recarray is created or not (current plan)

or

boolean _scan_res_isrecarray (default=False) Flag sets scan_res is either of type ndarray or recarray (and we update PyscesModel to deal with both)

Cheers Brett

On Wed, Nov 25, 2015 at 9:21 AM, Johann Rohwer notifications@github.com wrote:

Hi Brett

Your point is well taken, and indeed had occurred to me as well. While your solution gets rid of the problem, it is almost as "bad" in terms of required typing as what I've been using in scripts up to now: scan_rec_arr = np.rec.fromrecords(mod.scan_res, names=[mod.scan_in]+mod.scan_out) Your solution is still a lot of typing. I'm thinking of a use case where one is interactively exploring a model by scans and then wants to plot or otherwise work with the data. Most of the times these are smaller models (up to say 20 rxns) with scans of say 100 points, so the size of the scan_res array is not an issue in terms of memory usage.

I agree that it can become an issue for very large scans or a very large number of scan outputs. What about a flag you can change in mod.settings? We can discuss whether to create the recarrary by default or not :-) The point is that your proposed solution still creates a duplicate of the results when you call your getScanResAsRecArray() method. And I don't want to call that method after each scan.

Again, this stems from my personal use case. I agree that there needs to be a way to switch it off.

J.

— Reply to this email directly or view it on GitHub https://github.com/PySCeS/pysces/pull/16#issuecomment-159531867.

Brett G. Olivier

jmrohwer commented 8 years ago

If you update PySCeSModel to deal with a recarray I can see no reason as to why one should not turn scan_res into a recarray finish an klaar. However, this will involve some work as Scan1() as well as everything that depends on scan_res will have to be re-worked. Dunno how much work this is and if it is worth the effort. In any event it is trivial to retrieve the underlying ndarray from a recarray using .view('<f8'), but it always creates a new object. Just looking at the code, I think writing Scan1Plot() to use a recarray as input will be much simpler... and scan_res is used nowhere else in PyscesModel. I would therefore vouch for just changing scan_res into a recarray. I'd be willing to code this and update the PR, just need to confirm if you'd agree to the change... J.

exe0cdc commented 8 years ago

I have a suggestion:

A possible solution could be leave scan_res in order to preserve the API and then make scan a property of the model that is "lazy loaded". An implementation could look like this:

In the PysMod init add a self._scan = None

def __init__(*args):
    ....
    self._scan = None

and then set the property up like this:

@property
def scan(self):
    if not self._scan:
        self._scan = np.rec.fromrecords(self.scan_res,
                                        names = [self.scan_in] + self.scan_out)
    return self._scan

Now no extra setting will be needed and scan (actually _scan) will not be populated unless a user actually tries to access it. Even though scan is set up like a method it is accessed like a normal variable field because of the property decorator.

I use this pattern all the time in PySCeSToolbox in order to prevent CPU intensive calculations from taking place during object instantiation (they are only done when a user actually wants to access the data).

EDIT: I realise that duplication will still take place, but at least it won't take place without active user intervention. I can't see any alternative where both array and recarray functionality is provided without duplication occurring somewhere.

bgoli commented 8 years ago

Interesting idea. Running a new scan would simply have to set self._scan = None otherwise it wouldn't update.

On Wed, Nov 25, 2015 at 10:57 AM, Carl Christensen <notifications@github.com

wrote:

I have a suggestion:

A possible solution could be leave scan_res in order to preserve the API and then make scan a property of the model that is "lazy loaded". An implementation could look like this:

In the PysMod init add a self._scan = None

def init(*args): .... self._scan = None

and then set the property up like this:

@propertydef scan(self): if not self._scan: self._scan = np.rec.fromrecords(self.scan_res, names = [self.scan_in] + self.scan_out) return self._scan

Now no extra setting will be needed and scan (actually _scan) will not be populated unless a user actually tries to access it. Even though scan is set up like a method it is accessed like a normal variable field because of the property decorator.

I use this pattern all the time in PySCeSToolbox in order to prevent CPU intensive calculations from taking place during object instantiation (they are only done when a user actually wants to access the data).

— Reply to this email directly or view it on GitHub https://github.com/PySCeS/pysces/pull/16#issuecomment-159555596.

Brett G. Olivier

exe0cdc commented 8 years ago

oops.. forgot about that, but yes, that's how you would do it.

jmrohwer commented 8 years ago

OK to sum up the discussion I see two options for the way forward:

  1. Implement the scan recarray with lazy loading and stick with the current implementation of scan_res. This would leave us with the issue of duplicated data.
  2. Implement scan_res as a recarray from scratch. This is my preferred solution, but could break existing scripts using scan_res in terms of backward compatibility. I'm not too sure how to handle this or how serious it is. One option is to do the recarray implementation but give the array a new name (e.g. scan_data), and then also use this in Scan1Plot(). One could then implement a lazy loading of scan_res that would create the (data duplicated) ndarray, but only when required by legacy scripts (which would be the minority cases). One could even issue deprecation warnings. This configuration is thus the "other way round" compared to the suggestion by Carl.

Ideas? Thoughts? I actually have some time on hand to do some coding, so can get stuck in once we agree :-)

bgoli commented 8 years ago

1) this would be on-demand so I could live with it i.e. not generating both arrays by default all the time 2) whatever we do we should try to keep backwards compatibility for at least a few releases or 3) implement scan_res_r as the default holder of scan results (recarray, used internally) and scan_res as a lazy load (duplicated on demand so ok) ndarray

On Wed, Nov 25, 2015 at 12:50 PM, Johann Rohwer notifications@github.com wrote:

OK to sum up the discussion I see two options for the way forward:

  1. Implement the scan recarray with lazy loading and stick with the current implementation of scan_res. This would leave us with the issue of duplicated data.
  2. Implement scan_res as a recarray from scratch. This is my preferred solution, but could break existing scripts using scan_res in terms of backward compatibility. I'm not too sure how to handle this or how serious it is. One option is to do the recarray implementation but give the array a new name (e.g. scan_data), and then also use this in Scan1Plot(). One could then implement a lazy loading of scan_res that would create the (data duplicated) ndarray, but only when required by legacy scripts (which would be the minority cases). One could even issue deprecation warnings. This configuration is thus the "other way round" compared to the suggestion by Carl.

Ideas? Thoughts? I actually have some time on hand to do some coding, so can get stuck in once we agree :-)

— Reply to this email directly or view it on GitHub https://github.com/PySCeS/pysces/pull/16#issuecomment-159584950.

Brett G. Olivier

exe0cdc commented 8 years ago

The lazy loading of scan_res and a new scan_res_r (or scan_data) recarray as the default scan results container sounds like the best idea to me.

Sorry to bring up another option, but it should be possible to create a new object that behaves like a recarray and like a ndarray (allowing a user to access data using traditional indexing and column names). This would obviously solve both problems, however I don't know exactly how you would implement it.

jmrohwer commented 8 years ago

Carl I am not sure if this is possible. An ndarray and recarray have different shapes. A recarray sees every record as an item. To convert a recarray to an ndarray you have to get a view on it and then reshape according to the original dimensions of the ndarray:

In [1]: a = np.arange(10.0) 

In [2]: a = a.reshape((5,2))

In [3]: a
Out[3]: 
array([[ 0.,  1.],
       [ 2.,  3.],
       [ 4.,  5.],
       [ 6.,  7.],
       [ 8.,  9.]])

In [4]: b = np.rec.fromrecords(a, names=['first','second'])

In [5]: b
Out[5]: 
rec.array([(0.0, 1.0), (2.0, 3.0), (4.0, 5.0), (6.0, 7.0), (8.0, 9.0)], 
      dtype=[('first', '<f8'), ('second', '<f8')])

In [6]: b.shape
Out[6]: (5,)

In [7]: a.shape
Out[7]: (5, 2)

In [8]: b.view()
Out[8]: rec.array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

In [15]: b.view('<f8').reshape(5,2)
Out[15]: 
array([[ 0.,  1.],
       [ 2.,  3.],
       [ 4.,  5.],
       [ 6.,  7.],
       [ 8.,  9.]])

If you do come up with something, let me know. But I suspect that such an object would have additional computational overhead since it can behave in "both ways". Not sure this is something we desire...

exe0cdc commented 8 years ago

What I have in mind will mostly behave like an ndarray but will also allow you to access results like you would in a recarray (rec.col_name or rec['col_name'])

Such an object would inherit from ndarray or uses an array to store data internally. The only difference would be that it keeps track of column names and the 'getitem' and 'setitem' methods (which allows you to access the data stored in the array with square brackets) would be overwritten. You would also create fields for each of the columns which would contain a view of that column. This is not a full implementation but here's an idea:

class TestArr(np.ndarray):
    def __init__(self,*args,**kwargs):
        super(TestArr, self).__init__(*args, **kwargs)
        self.column_names = ('a','b')
    def __getitem__(self,z):
        if type(z) == str:
            col = self.column_names.index(z)
            return self[:,col]
        else:
            return np.ndarray.__getitem__(self,z)

t = TestArr(shape=(2,2))

t[0,0] = 1.0
t[0,1] = 2.0
t[1,0] = 1.0
t[1,1] = 2.0

t
output: TestArr([[ 1.,  2.],
                [ 1.,  2.]])

t[:,0]
output : TestArr([ 1.,  1.])
t['a']
output : TestArr([ 1.,  1.])
bgoli commented 8 years ago

Sorry but I don't like this idea at all. Subclassing stuff in this case is overkill, never mind performance, serialization and maintenance. I think in the end it's either lazy import of one or the other arrays or we just live with the duplication until it becomes a problem ... then you can always manually delete one of the objects.

After consideration most high density scans are going to probably be done with the n-dimensional scanner, so, the use case for Scan1 here is more like Johann's, let's just duplicate the results to "scan_res_r" for now.

On Wed, Nov 25, 2015 at 2:45 PM, Carl Christensen notifications@github.com wrote:

What I have in mind will mostly behave like an ndarray but will also allow you to access results like you would in a recarray (rec.col_name or rec['col_name'])

Such an object would inherit from ndarray or (2) uses an array to store data internally. The only difference would be keep track of column names and the 'getitem' and 'setitem' methods (which allows you to access the data stored in the array with square brackets) would be overwritten. You would also create fields for each of the columns which would contain a view of that column. This is not a full implementation but here's an idea:

class TestArr(np.ndarray): def init(self,_args,__kwargs): super(TestArr, self).init(_args, kwargs) self.column_names = ('a','b') def getitem(self,z): if type(z) == str: col = self.column_names.index(z) return self[:,col] else: return np.ndarray.getitem**(self,z)

t = TestArr(shape=(2,2))

t[0,0] = 1.0 t[0,1] = 2.0 t[1,0] = 1.0 t[1,1] = 2.0

t output: TestArr([[ 1., 2.], [ 1., 2.]])

t[:,0] output : TestArr([ 1., 1.]) t['a'] output : TestArr([ 1., 1.])

— Reply to this email directly or view it on GitHub https://github.com/PySCeS/pysces/pull/16#issuecomment-159611687.

Brett G. Olivier

jmrohwer commented 8 years ago

OK I've added a commit that uses lazy loading for the record array.

I trust this is now ready for merging :-)

bgoli commented 8 years ago

Looks good to me ... go for it!

On a related note. If there is any other stuff that needs to be fixed/added now would also be a good time as we should roll out a new release soon. The matplotlib/numpy compatibility fixes are needed for the latest Anaconda (and probably Ubuntu) releases :-)

cheers b,

On Thu, Nov 26, 2015 at 12:57 PM, Johann Rohwer notifications@github.com wrote:

OK I've added a commit that uses lazy loading for the record array.

  • I've opted to keep the ndarray for mod.scan_res as default (i.e. the current situation), and calculate the record array on demand using lazy loading. The reason is that the whole PyscesPlot2 framework (which is also used by Scan1Plot() is written with ndarrays in mind. While the individual plotting functions of Scan1Plot() might be simpler to implement using record arrays, this is not the case for the whole plotting framework, and we don't want to re-write that.
  • I've opted to keep the name m.scan for the record array in favour of m.scan_res_r or m.scan_data for the following reasons:
    • I want to keep typing as short as possible (as already mentioned above). The main use cases I see for this are to interactively plot flux or concentration ratios, for example, and then the extra typing makes a huge difference. Compare mod.scan.J_R1/mod.scan.J_R2 with mod.scan_res_r.J_R1/mod.scan_res_r.J_R2 for example.
    • There are no naming conflicts or clashes.
    • Backward compatibility is maintained.

I trust this is now ready for merging :-)

— Reply to this email directly or view it on GitHub https://github.com/PySCeS/pysces/pull/16#issuecomment-159894717.

Brett G. Olivier