Closed duduklein closed 4 years ago
I just realized that this whole code is problematic:
if isinstance(data[0], hszinc.Quantity):
values = [each.value for each in data]
units = data[0].unit
else:
values = data
units = ''
If the 1st element in data, data[0]
, is a "nan" (float) we would not convert anything. Maybe, we could do:
values = [each.value for each in data if isinstance(each, hszinc.Quantity)]
The downside of this approach is that if test isinstance(each, hszinc.Quantity)
for every element. Can we know in advance (that at least one entry was transformed to hszinc.Quantity?
If we have units, are they always the same, if yes, we could do:
for each in data:
if isinstance(each, hszinc.Quantity):
units = each.unit
break
else:
units = ""
If not : units = [each.unit for each in data if isinstance(each, hszinc.Quantity)]
I think the NaN should be provided as a "Quantity" with NaN as value... if it's possible. The type should be provided correctly.
Github helpfully lost my earlier reply. I'll re-post here.
I just realized that this whole code is problematic:
Indeed, actually, we probably want to do something like this:
def _convert(value):
if isinstance(value, hszinc.Quantity):
return (value.value, value.unit)
else:
return (value, None)
(values, units) = zip(*map(_convert, data))
unit = units[0]
if not all(lambda u : u == unit, units[1:]):
unit = None
That might need some tweaking, but the gist is, "split up the values and the units into two different arrays, take the first unit, and if the remaining units are different, set unit to None.
Very good discussion.
Let me ask you 2 "simple" questions. 1) Why must the unit be unique. The the haystack server sends a unit per data point alongside with its timestamp and value, why do we need to strip out this information ? 2) The data we receive from the haystack server should be in the following format: "TIMESTAMP VALUE UNIT" Why do we parse it once to transform it into a hszinc.Quantity instance to deconstruct it again to have a standard list of values at the end ? Wouldn't it be better to just leave it as hszinc.Quantity or simply not use it at all ?
In terms of performance, the proposed code seems to me inefficient, since we iterate over the while lists 4 times. One for the "map", that constructs a list of tuples, than with a zip that deconstructs what we have just constructed and constructs 2 new lists and a fourth one that will iterate over the units array only to check its uniqueness (we should ignore the None values of the units array as well).
Something like the above seems more efficient and would achieve the same goal. I used the value "False" to indicate that no unit was set. Defining an UNSET value would be better.
values = []
unit = False
for each in data:
is_hszinc_quantity = isinstance(each, hszinc.Quantity)
values.append(each.value if is_hszinc_quantity else each)
if unit is False and is_hszinc_quantity:
unit = each.unit
elif unit and is_hszinc_quantity and each.unit != unit:
unit = None
I had the understanding that it was bad practice to constantly call .append
on a list for performance reasons, but a quick search seems to suggest that understanding was in error. :-)
We have to keep in mind that the result we need here is a pandas Serie. With a little add-on metadata containing units and point reference. Actually, I don’t know of a way to have a unit-aware série in Pandas (but I think this is something discussed). Which means that even if we extract unit for each data value ... we’ll end up using just one.
You are right though about the problem if there is a NaN in first row.
We could also think about making a Dataframe that could contain a second column with units. But it adds a layer of complexity I’m not super attracted to.
First thing to correct, IMO is hszinc must return a Quantity when result is +- inf or NaN. A None quantity is easily understandable as undefined or simply no-unit.
Then we can solidify this loop and deal with “what if I find a NaN in first row” “What unit must I choose if the série contains more than one ?” (A weird situation but maybe someone switch degF to degC at some point...)
The challenge with hszinc
returning a Quantity
is one of context. Take a look at the ZINC grammar for a number:
<number> := <decimal> | "INF" | "-INF" | "NaN"
<decimal> := ["-"] <digits> ["." <digits>] [<exp>] [<unit>]
<exp> := ("e"|"E") ["+"|"-"] <digits>
<unit> := <unitChar>*
<unitChar> := <alpha> | "%" | "_" | "/" | "$" | any char > 128 // see Units
NaN°C
is invalid in ZINC. So is INF°C
. (Although sometimes the weather can feel like it!)
hszinc
at present takes the view that a scalar should be represented with the most generic Python data type that as closely matches the given scalar as possible. So if a value is given without a unit, the correct type according to that policy is float
(actually, there are some cases where int
would be better… that's on my TO-DO list… but I digress).
Now, hszinc
when it considers a scalar, does not consider the values of other scalars that may exist in the grid. Column 3 may have a value in °C
in one row and °F
in another row of that same column -- nonsensical, but valid in ZINC.
Since INF
or NaN
does not carry unit information, and the context of other quantities is not considered, we come to the situation where NaN
shows up as a bare float
in amongst fully described Quantity
objects.
I see your point. I would have preferred a closer to the source fix... Anyway, this lead to a context decision to be made in pyhaystack then. We need to check type of value and make a decision about units.
In most of the cases, the units should be consistent in a grid. Technically. What if we would create a new serie (or just a list if more memory efficient) ... containing units while we build the value one. If all the units are the same... keep this value and do as we always did. If units are different (and non-null)... then attach this unit serie (or list) to the metadata. It will be possible to create a link between data and unit using this information.
Again, all that to stay in the limits of pandas... as Quantity in series.value ( at least when I first made the tests) were not as optimal than floats or ints.
This is interesting : https://pint.readthedocs.io/en/latest/pint-pandas.html
This is still at early stage and seem to be moving to : https://github.com/hgrecco/pint-pandas
Very interesting the pint-pandas thing in the pint readthedocs. Apparently it allows for one unit per series, at least it's not clear to me how to support multiple units. I'm not a panda expert, but it may be possible to add "manually" a entry to a series with a specific unit instead of initializing pd.Series directly with pre-defined values. If this is the case, it would solve the problem and even "normalize" all the values to the same unit.
Regarding the separated project, the last commit was on Jan 16, 2019, so I'm not sure it's an active project.
The zen of python says explicit is better than implicit :) I'm heading toward this :
units = ""
values = []
if self._series_format == self.FORMAT_DICT:
data = dict(data)
elif self._series_format == self.FORMAT_SERIES:
# Split into index and data.
try:
(index, data) = zip(*data)
if isinstance(data[0], hszinc.Quantity) or isinstance(data[-1], hszinc.Quantity):
for each in data:
try:
values.append(each.value)
if units == "":
units = each.unit
except AttributeError:
if isinstance(each,float):
values.append(each)
continue
else:
values = data
except ValueError:
values = []
index = []
units = ""
The first if will cover more case if bad luck gives us a nan in first row. The second part (looking for the last item) won't be evaluated if the first test is true The units will be set once. The try / except AttributeError should not interfere often as the exception would occur only in the case of a NaN. The test for float seems important to me as a "just in case"
Please test and comment, I pushed in develop
Thanks @ChristianTremblay.
Let me add some considerations and questions:
1) Keeping a pandas's series simple (without overhead for units) seems really reasonable for me. pint-pandas integrated with pandas or the separate project may offer an alternative to better handling this issue. The separate project apparently is not fully compatible with the latest version of pandas. See :https://github.com/hgrecco/pint-pandas/pull/19
2) @sjlongland , Although NaN°C
is invalid by ZINC syntax, NaN is valid. So why not return a PintQuantity where value is NaN and unit is None or "" ? I agree with ChristianTremblay that hszinc.Quantity should support it. Do you know how other packages that implement ZINC standard in other languages than python treat NaN? It would be nice to have a consistency and maybe even add some specification to the standard.
3) Checking for the 1st and last quantities in the array, IMO, is almost the same as looking for the 1st one only, i.e., as you said, relying on bad (or good) luck.
4) Code-wise. The proposed code will raise an uncaught IndexError if data is empty. Why do we have this "global" try/catch with AttributeError? Python suggests that we encapsulate the try/catch code where we expect the error and not with global statements. Having this in mind, I would remove the global try.catch and change the on in the middle to :
try:
values.append(each.value)
except AttributeError:
if isinstance(each,float):
values.append(each)
else:
if units == "":
units = each.unit
5) One simple way of transforming the code I proposed, so it's more explicit (I guess you did not like the False/None thing and I agree with you), would be to write this:
values = []
UNSET = None
MULTIPLE_UNITS = ""
for each in data:
try:
values.append(each.value)
except AttributeError:
if isinstance(each,float):
values.append(each)
else:
if unit is UNSET:
unit = each.unit
elif unit and is_hszinc_quantity and each.unit != unit:
unit = MULTIPLE_UNITS
This could be improved by using a class to define UNSET and even MULTIPLE_UNITS, but this is already very explicit and also defines unit only once.As most of the time each.value will be valid, according to python standards, you are right and we should use a try/catch instead of an "if" statement. 6) I have to say that the fact of loosing the unit info per data point still bothers me. I may use the list or dict series format until this issue is solved. BTW, is it a way to pass the wanted series format in the haystack4 session constructor like we do for the grid format?
The pint-pandas is still at early stage and it is not ready I think. But in the future, I think it would be a good choice.... in the long term.
If we refer to https://project-haystack.org/doc/Zinc, I see @sjlongland point. hszinc
goal is to fit 100% with the standard and even if I would like edge case being handled in hszinc, at the end, it is better to let it as-is. Until something different be adopted by project-haystack. I don't see changes for haystack4 but I guess it would do no harm if you post a question on project-haystack to see how other deals with this.
As explained, if first condition is true... the second one is not checked. So this line, in most of the cases won't change a thing... but if by bad luck, first item was NaN... we have another chance to detect quantity.
each.unit
have to be part of the try/catch as if data is not a quantity... the else will raise an exception.
5 & 6. I need time to get my head around this. And regarding units per series, I'm not even sure if it's a real problem (having more than one unit per serie).... In Niagara for example, there is only one facet per series. Each record is saved without the unit.
Look :
The last record is the value saved. the valueFacets correspond to the point facets. If you change to degC or degF, the series will change accordingly. You won't be able to have more than 1 unit per history serie.
This is true for Niagara, may not be the case for others, I agree.
Regarding the last question, I don't understand.
BTW, is it a way to pass the wanted series format in the haystack4 session constructor like we do for the grid format?
What do you want to do ?
Thanks for all detailed comments. I see your points.
Regarding 4 (removing the each.unit
from the try/catch), I did this, since this statement will only be executed in the "else" of the try/catch, which means that it was a hszinc.Quantity, since each.value
was successfully evaluated. Can we have something other than hszinc.Quantity that would have a value
attribute, but not a unit
one?
I thought Niagara stored history points with units. Thanks for pointing that out. I wonder what would happen to the values in the existing history table. Would they be converted according to the new unit or just left like that (the old ones in °C and the new ones in °F)?
When I wrote that
BTW, is it a way to pass the wanted series format in the haystack4 session constructor like we do for the grid format?
I wanted to know if there was a way to set the grid_format to dict or list when I initialize the Niagara4HaystackSession
object, or if I needed to pass this parameter at every "his_read_series" call. I know the default value is using "series" format, but maybe we could set this default value in the HaystackSession
constructor (init method).
In Niagara, the value will not be converted. The facets will change, that's it.
I targetted pandas Series because it make sense to use them for numerical analysis. I'll check for getting other format but IMO, it's counter productive. Loosing all the features of pandas...
I must admit, the idea of pairing values with their units in the way they have, makes me wonder: "why did they do that?" Units are a display thing, if someone needs to know the unit, consult the tag, marry the two in the UI, or in pyhaystack
's case, we could apply it to the Pandas' series
ourselves.
For what it's worth, WideSky does the same as Niagara, it just populates the records in a hisRead
with whatever the units
tag says it is (without validation no less… which gives hszinc
some consternation at times). We did have a crack at doing unit conversions in-server in our GraphQL interface, but then backed away from it because it opened a huge can of worms.
There's no requirement for a Haystack server to supply units at all in a hisRead
response, so all this unit conversion stuff is a little on the fragile side. It should provide a units
tag (String) which specifies the units being used. Maybe the answer is we have hisRead
look there to pick up the unit, and apply that to all rows in the result set?
What I'm not comfortable with is that in most of the cases, there is only one unit by column. Adding the unit to each of values is, from my point of view, a waste of resources and an unneeded supplemental layer of complexity.
I'm also pretty sure that others broke their teeth on the units problem. Which go in the same direction than your Widesky story.
I know that Pandas and numpy are widely used by the scientific community. I prefer to wait for progress from them before deciding of a breaking change in our implementation. Those guys work with weird units all the time and they use series and dataframe in their work in a way I will never be able to compete with. I think it is a safe decision to stick with "common" and widely used "things" instead of re-inventing the wheel...
Like I would never reinvent CSV. There are already too many formats out there.
I agree with both of you. It's very reasonable to have 1 unit per column and probably. As both of you pointed out, Niagara and WideSky don't store the units in the history and for some reason the haystack return includes the units, but it seems that the safest way of getting the unit is getting it from the point itself instead of getting it from the values, as suggested by @sjlongland
This would solve our main issues and we would only need to generate the "data" list.
a simple line of code as :
data = [each.value if isinstance(each, hszinc.Quantity) else each for each in data]
The test to float to me does not seem so important, because we would not know what to do in case it's not a float. Skipping the value does not seem a good idea to me. but in case you would like to do this, we could do
data = [getattr(each, "value", each) for each in data if isinstance(each, (hszinc.Quantity, float))]
What do you think?
I think this question is related to this one : https://github.com/ChristianTremblay/pyhaystack/issues/101 (at leatst for the NaN part...)
For other details about the different units or quantity, I have no time available to dig into that. So I will close for now.
@duduklein Sorry for making you wait this long... Hope the fix will at least help with the missing values.
When using the method his_read_series to read the history values of a point a "nan" (float) value raises the error "AttributeError: 'float' object has no attribute 'value'"
The error is caused at "pyhaystack/client/ops/his.py", line 127 (as time of writing)
values = [each.value for each in data]
Screenshot of niagara history containing the nan value : https://files.gitter.im/ChristianTremblay/pyhaystack/j2I0/image.png
@ChristianTremblay , let me know if you want me to make a PR and how you want to correct it. I would test for either the "value" attribute in each or check if it's a float. The advantage of the former is that we wouldn't raise an exception on other cases where "value" is not present and the disadvantage is that it may hide other issues. Personaly, I prefer the later, since it's more specific.