Closed jmrohwer closed 8 years ago
Hi Johann
In principle this is a good idea the only minor drawback with the current proposal is that the scan results are stored twice ... which is not so good for very large scans. What do you think of the following type thing that uses a function to generate the new array.
def getScanResAsRecArray(): "check that there are scan results" "create the record array" return "new record array"
Then all a user would have to do is: mod.Scan1(rng) scan_rec_arr = mod.getScanResAsRecArray()
This way we only store the data in one place. b.
Hi Brett
Your point is well taken, and indeed had occurred to me as well. While your solution gets rid of the problem, it is almost as "bad" in terms of required typing as what I've been using in scripts up to now:
scan_rec_arr = np.rec.fromrecords(mod.scan_res, names=[mod.scan_in]+mod.scan_out)
Your solution is still a lot of typing. I'm thinking of a use case where one is interactively exploring a model by scans and then wants to plot or otherwise work with the data. Most of the times these are smaller models (up to say 20 rxns) with scans of say 100 points, so the size of the scan_res array is not an issue in terms of memory usage.
I agree that it can become an issue for very large scans or a very large number of scan outputs. What about a flag you can change in mod.__settings__
? We can discuss whether to create the recarrary by default or not :-) The point is that your proposed solution still creates a duplicate of the results when you call your getScanResAsRecArray()
method. And I don't want to call that method after each scan.
Again, this stems from my personal use case. I agree that there needs to be a way to switch it off.
J.
Hi Johann
My editor has text completion, but point taken :-) The advantage of a get method is that it returns a copy, but doesn't store it as part of the model object itself, setting a flag is a good idea:
boolean _create_scan_resrecarray (default=True) Flag sets whether a duplicate recarray is created or not (current plan)
or
boolean _scan_res_isrecarray (default=False) Flag sets scan_res is either of type ndarray or recarray (and we update PyscesModel to deal with both)
Cheers Brett
On Wed, Nov 25, 2015 at 9:21 AM, Johann Rohwer notifications@github.com wrote:
Hi Brett
Your point is well taken, and indeed had occurred to me as well. While your solution gets rid of the problem, it is almost as "bad" in terms of required typing as what I've been using in scripts up to now: scan_rec_arr = np.rec.fromrecords(mod.scan_res, names=[mod.scan_in]+mod.scan_out) Your solution is still a lot of typing. I'm thinking of a use case where one is interactively exploring a model by scans and then wants to plot or otherwise work with the data. Most of the times these are smaller models (up to say 20 rxns) with scans of say 100 points, so the size of the scan_res array is not an issue in terms of memory usage.
I agree that it can become an issue for very large scans or a very large number of scan outputs. What about a flag you can change in mod.settings? We can discuss whether to create the recarrary by default or not :-) The point is that your proposed solution still creates a duplicate of the results when you call your getScanResAsRecArray() method. And I don't want to call that method after each scan.
Again, this stems from my personal use case. I agree that there needs to be a way to switch it off.
J.
— Reply to this email directly or view it on GitHub https://github.com/PySCeS/pysces/pull/16#issuecomment-159531867.
Brett G. Olivier
If you update PySCeSModel to deal with a recarray I can see no reason as to why one should not turn scan_res into a recarray finish an klaar. However, this will involve some work as Scan1()
as well as everything that depends on scan_res will have to be re-worked. Dunno how much work this is and if it is worth the effort. In any event it is trivial to retrieve the underlying ndarray from a recarray using .view('<f8')
, but it always creates a new object.
Just looking at the code, I think writing Scan1Plot()
to use a recarray as input will be much simpler... and scan_res
is used nowhere else in PyscesModel. I would therefore vouch for just changing scan_res
into a recarray. I'd be willing to code this and update the PR, just need to confirm if you'd agree to the change...
J.
I have a suggestion:
A possible solution could be leave scan_res
in order to preserve the API and then make scan
a property of the model that is "lazy loaded". An implementation could look like this:
In the PysMod init add a self._scan = None
def __init__(*args):
....
self._scan = None
and then set the property up like this:
@property
def scan(self):
if not self._scan:
self._scan = np.rec.fromrecords(self.scan_res,
names = [self.scan_in] + self.scan_out)
return self._scan
Now no extra setting will be needed and scan
(actually _scan
) will not be populated unless a user actually tries to access it. Even though scan
is set up like a method it is accessed like a normal variable field because of the property decorator.
I use this pattern all the time in PySCeSToolbox in order to prevent CPU intensive calculations from taking place during object instantiation (they are only done when a user actually wants to access the data).
EDIT: I realise that duplication will still take place, but at least it won't take place without active user intervention. I can't see any alternative where both array and recarray functionality is provided without duplication occurring somewhere.
Interesting idea. Running a new scan would simply have to set self._scan = None otherwise it wouldn't update.
On Wed, Nov 25, 2015 at 10:57 AM, Carl Christensen <notifications@github.com
wrote:
I have a suggestion:
A possible solution could be leave scan_res in order to preserve the API and then make scan a property of the model that is "lazy loaded". An implementation could look like this:
In the PysMod init add a self._scan = None
def init(*args): .... self._scan = None
and then set the property up like this:
@propertydef scan(self): if not self._scan: self._scan = np.rec.fromrecords(self.scan_res, names = [self.scan_in] + self.scan_out) return self._scan
Now no extra setting will be needed and scan (actually _scan) will not be populated unless a user actually tries to access it. Even though scan is set up like a method it is accessed like a normal variable field because of the property decorator.
I use this pattern all the time in PySCeSToolbox in order to prevent CPU intensive calculations from taking place during object instantiation (they are only done when a user actually wants to access the data).
— Reply to this email directly or view it on GitHub https://github.com/PySCeS/pysces/pull/16#issuecomment-159555596.
Brett G. Olivier
oops.. forgot about that, but yes, that's how you would do it.
OK to sum up the discussion I see two options for the way forward:
scan
recarray with lazy loading and stick with the current implementation of scan_res
. This would leave us with the issue of duplicated data.scan_res
as a recarray from scratch. This is my preferred solution, but could break existing scripts using scan_res
in terms of backward compatibility. I'm not too sure how to handle this or how serious it is. One option is to do the recarray implementation but give the array a new name (e.g. scan_data
), and then also use this in Scan1Plot()
. One could then implement a lazy loading of scan_res
that would create the (data duplicated) ndarray, but only when required by legacy scripts (which would be the minority cases). One could even issue deprecation warnings. This configuration is thus the "other way round" compared to the suggestion by Carl.Ideas? Thoughts? I actually have some time on hand to do some coding, so can get stuck in once we agree :-)
1) this would be on-demand so I could live with it i.e. not generating both arrays by default all the time 2) whatever we do we should try to keep backwards compatibility for at least a few releases or 3) implement scan_res_r as the default holder of scan results (recarray, used internally) and scan_res as a lazy load (duplicated on demand so ok) ndarray
On Wed, Nov 25, 2015 at 12:50 PM, Johann Rohwer notifications@github.com wrote:
OK to sum up the discussion I see two options for the way forward:
- Implement the scan recarray with lazy loading and stick with the current implementation of scan_res. This would leave us with the issue of duplicated data.
- Implement scan_res as a recarray from scratch. This is my preferred solution, but could break existing scripts using scan_res in terms of backward compatibility. I'm not too sure how to handle this or how serious it is. One option is to do the recarray implementation but give the array a new name (e.g. scan_data), and then also use this in Scan1Plot(). One could then implement a lazy loading of scan_res that would create the (data duplicated) ndarray, but only when required by legacy scripts (which would be the minority cases). One could even issue deprecation warnings. This configuration is thus the "other way round" compared to the suggestion by Carl.
Ideas? Thoughts? I actually have some time on hand to do some coding, so can get stuck in once we agree :-)
— Reply to this email directly or view it on GitHub https://github.com/PySCeS/pysces/pull/16#issuecomment-159584950.
Brett G. Olivier
The lazy loading of scan_res
and a new scan_res_r
(or scan_data
) recarray as the default scan results container sounds like the best idea to me.
Sorry to bring up another option, but it should be possible to create a new object that behaves like a recarray and like a ndarray (allowing a user to access data using traditional indexing and column names). This would obviously solve both problems, however I don't know exactly how you would implement it.
Carl I am not sure if this is possible. An ndarray and recarray have different shapes. A recarray sees every record as an item. To convert a recarray to an ndarray you have to get a view on it and then reshape according to the original dimensions of the ndarray:
In [1]: a = np.arange(10.0)
In [2]: a = a.reshape((5,2))
In [3]: a
Out[3]:
array([[ 0., 1.],
[ 2., 3.],
[ 4., 5.],
[ 6., 7.],
[ 8., 9.]])
In [4]: b = np.rec.fromrecords(a, names=['first','second'])
In [5]: b
Out[5]:
rec.array([(0.0, 1.0), (2.0, 3.0), (4.0, 5.0), (6.0, 7.0), (8.0, 9.0)],
dtype=[('first', '<f8'), ('second', '<f8')])
In [6]: b.shape
Out[6]: (5,)
In [7]: a.shape
Out[7]: (5, 2)
In [8]: b.view()
Out[8]: rec.array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
In [15]: b.view('<f8').reshape(5,2)
Out[15]:
array([[ 0., 1.],
[ 2., 3.],
[ 4., 5.],
[ 6., 7.],
[ 8., 9.]])
If you do come up with something, let me know. But I suspect that such an object would have additional computational overhead since it can behave in "both ways". Not sure this is something we desire...
What I have in mind will mostly behave like an ndarray but will also allow you to access results like you would in a recarray (rec.col_name or rec['col_name'])
Such an object would inherit from ndarray or uses an array to store data internally. The only difference would be that it keeps track of column names and the 'getitem' and 'setitem' methods (which allows you to access the data stored in the array with square brackets) would be overwritten. You would also create fields for each of the columns which would contain a view of that column. This is not a full implementation but here's an idea:
class TestArr(np.ndarray):
def __init__(self,*args,**kwargs):
super(TestArr, self).__init__(*args, **kwargs)
self.column_names = ('a','b')
def __getitem__(self,z):
if type(z) == str:
col = self.column_names.index(z)
return self[:,col]
else:
return np.ndarray.__getitem__(self,z)
t = TestArr(shape=(2,2))
t[0,0] = 1.0
t[0,1] = 2.0
t[1,0] = 1.0
t[1,1] = 2.0
t
output: TestArr([[ 1., 2.],
[ 1., 2.]])
t[:,0]
output : TestArr([ 1., 1.])
t['a']
output : TestArr([ 1., 1.])
Sorry but I don't like this idea at all. Subclassing stuff in this case is overkill, never mind performance, serialization and maintenance. I think in the end it's either lazy import of one or the other arrays or we just live with the duplication until it becomes a problem ... then you can always manually delete one of the objects.
After consideration most high density scans are going to probably be done with the n-dimensional scanner, so, the use case for Scan1 here is more like Johann's, let's just duplicate the results to "scan_res_r" for now.
On Wed, Nov 25, 2015 at 2:45 PM, Carl Christensen notifications@github.com wrote:
What I have in mind will mostly behave like an ndarray but will also allow you to access results like you would in a recarray (rec.col_name or rec['col_name'])
Such an object would inherit from ndarray or (2) uses an array to store data internally. The only difference would be keep track of column names and the 'getitem' and 'setitem' methods (which allows you to access the data stored in the array with square brackets) would be overwritten. You would also create fields for each of the columns which would contain a view of that column. This is not a full implementation but here's an idea:
class TestArr(np.ndarray): def init(self,_args,__kwargs): super(TestArr, self).init(_args, kwargs) self.column_names = ('a','b') def getitem(self,z): if type(z) == str: col = self.column_names.index(z) return self[:,col] else: return np.ndarray.getitem**(self,z)
t = TestArr(shape=(2,2))
t[0,0] = 1.0 t[0,1] = 2.0 t[1,0] = 1.0 t[1,1] = 2.0
t output: TestArr([[ 1., 2.], [ 1., 2.]])
t[:,0] output : TestArr([ 1., 1.]) t['a'] output : TestArr([ 1., 1.])
— Reply to this email directly or view it on GitHub https://github.com/PySCeS/pysces/pull/16#issuecomment-159611687.
Brett G. Olivier
OK I've added a commit that uses lazy loading for the record array.
mod.scan_res
as default (i.e. the current situation), and calculate the record array on demand using lazy loading. The reason is that the whole PyscesPlot2
framework (which is also used by Scan1Plot()
is written with ndarrays in mind. While the individual plotting functions of Scan1Plot()
might be simpler to implement using record arrays, this is not the case for the whole plotting framework, and we don't want to re-write that.m.scan
for the record array in favour of m.scan_res_r
or m.scan_data
for the following reasons:
mod.scan.J_R1/mod.scan.J_R2
with mod.scan_res_r.J_R1/mod.scan_res_r.J_R2
for example.I trust this is now ready for merging :-)
Looks good to me ... go for it!
On a related note. If there is any other stuff that needs to be fixed/added now would also be a good time as we should roll out a new release soon. The matplotlib/numpy compatibility fixes are needed for the latest Anaconda (and probably Ubuntu) releases :-)
cheers b,
On Thu, Nov 26, 2015 at 12:57 PM, Johann Rohwer notifications@github.com wrote:
OK I've added a commit that uses lazy loading for the record array.
- I've opted to keep the ndarray for mod.scan_res as default (i.e. the current situation), and calculate the record array on demand using lazy loading. The reason is that the whole PyscesPlot2 framework (which is also used by Scan1Plot() is written with ndarrays in mind. While the individual plotting functions of Scan1Plot() might be simpler to implement using record arrays, this is not the case for the whole plotting framework, and we don't want to re-write that.
- I've opted to keep the name m.scan for the record array in favour of m.scan_res_r or m.scan_data for the following reasons:
- I want to keep typing as short as possible (as already mentioned above). The main use cases I see for this are to interactively plot flux or concentration ratios, for example, and then the extra typing makes a huge difference. Compare mod.scan.J_R1/mod.scan.J_R2 with mod.scan_res_r.J_R1/mod.scan_res_r.J_R2 for example.
- There are no naming conflicts or clashes.
- Backward compatibility is maintained.
I trust this is now ready for merging :-)
— Reply to this email directly or view it on GitHub https://github.com/PySCeS/pysces/pull/16#issuecomment-159894717.
Brett G. Olivier
This is a small addition to the
Scan1()
method to make the results of the scan available as a record array. The fields of the record array are created automatically from a concatenated list of[mod.scan_in]+mod.scan_out
and populated with the numerical results frommod.scan_res
. Existing functionality ofmod.scan_res
is unchanged. This convenience function will make it very easy to plot selective scan results when exploring models interactively, e.g.It is also easy to do post-operations on scan results, e.g. calculate concentration ratios. Comments?