andreatramacere / magic-backend

0 stars 2 forks source link

Nan and empty fields #3

Open micheledoro opened 4 years ago

micheledoro commented 4 years ago

In our current format, we have decided to have all columns with one entry. Where the data is not present we put 'nan', as for example in the higher error of an upper limit. Is this ok? However, there are cases in which e.g. the central point is not given, for example one has only the lower and upper edge of the frequency bin.

GRB190114c nan 1.2194645e+19 2.475779e+19 5.000393e-08 4.9619406e-09 4.9619406e-09 68.0 110.0 42.0 'GBM' "blue cross in fig2" GRB190114C nan 2.475779e+19 7.150961e+19 4.117594e-08 8.413029e-09 8.413029e-09 68.0 100.0 42.0 'GBM' "blue cross in fig2"

Another example of only extremes is when quoting the butterfly, where the central point is not given.

What to do in these cases? Does it make sense to update data with some reasonable value such as 'the geometrical mean point'?

andreatramacere commented 4 years ago

1) putting nan can work, but we need to avoid confusion with the other issues you mention above 2) it makes sense to have the 'the geometrical mean point', but we need to understand how to treat these data when they are used for model fitting. 3) On top of this, we should also have a column with the boolean flag for the upper limits.

In conclusions, If we filter nan points, then I would suggest to put another column, that describes the data format for each row, for example: -) butterfly -> butterfly -) error -> data best value and with +/- error -) edge -> data with only +/- error -) ul -> upper limis

In this scenario, butterfly and edge points can be treated separately, on the contrary a rows with error datatype and nan in the flux column will be ignored. To make things more clear, we can use masked table in astropy (https://docs.astropy.org/en/stable/table/masking.html), to mask the flux entry for butterfly and edges, putting as fill_value an integer that corresponds to the method used to evaluate the flux (e.g. 0->mean 1->' geometrical mean', or -1-> 'ignore' as in the case of butterfly. In conclusion, we can figure out something like:

# flux     err+     err-    dataformat  UL
1E-12      1E-13    2E-13   error       False
-1         1E-13    2E-13   butterfly   False
1          1E-13    2E-13   edge        False
1E-14      -1       -1      ul          True
nan        nan      nan     error       False

the last one will be skipped, and the others will be treated according to the dataformat and UL values, and the code of the fill value. The UL column might be redundant, but probably it is better to keep it.

micheledoro commented 4 years ago

Interesting solutions, at least that of specifying the data format. I believe this can reduce the errors when using our data.

However, I still prefer to use the 'nan' instead of 'integers'. At least if you see nan and read it uncarefully, there is no risk you read the integer and you may get an error by the program.

I don't see the necessity of the last column (UL), why do you need it?

andreatramacere commented 4 years ago

1) If we decide that for each dataformat there is a single behaviour, then we can use nan everywhere. Even though, we might need at some point to distinguish between nan as not a number, from nan as ignore it.

2) Regarding the UL column, yes we can remove it, it is a personal bias since I tend to overcomplicate things... :)

micheledoro commented 4 years ago

So for me ok with the 'data format', at least for me. For example, this helps a lot in the case in which you have both butterfly data and error data in the same file. I will implement that. Just to be full clear: 'edge' is a when you don't have the central point but it is not a butterfly right?

andreatramacere commented 4 years ago

yes