Open lmar76 opened 7 years ago
It is only possible to "merge" the products using the same timestamp. We use, as we described, the nearest neighbor "interpolation" to achieve this. Keeping in a sense "original" values. We could think of using other interpolation methods (linear, etc) which can introduce some issues especially in edge cases, which are fairly common in the plasma data (a lot of holes in the data etc) where any kind of interpolation could create anomalies. Do you have another approach for "merging" the data which you consider more suited?
I have discussed with some people about the issue and they suggested that "merging" data is fine while visualizing the parameters in the scatter plot area and/or in the globe map but the downloaded data should not be merged (because this is a deviation from the original products). So, they suggested to maintain separated the parameters coming from distinct collections either separating them in the same file (but I don't know if this can be done) or download multiple files (one per collection). What do you think?
The issue is not trivial, in principle we can write this in one file, but it is less then ideal. For example as CSV we could have the header as shown in your example above "Timestamp, F, n" . For Timestamps which correspond to magnetic data F would have a value and n null and for plasma the other way around.
Combined MAGA_LR_1B-EFIA_PL_1B
Timestamp F n
------------------------------------------
2016-03-02T00:00:00Z 29674.1316 null
2016-03-02T00:00:01Z 29650.4338 null
2016-03-02T00:00:02Z 29626.7699 null
2016-03-02T00:00:03Z 29603.1449 null
2016-03-02T00:00:04Z 29579.5541 null
2016-03-02T00:00:05Z 29555.9937 null
2016-03-02T00:00:06Z 29532.4728 null
2016-03-02T00:00:07Z 29508.9925 null
2016-03-02T00:00:08Z 29485.5375 null
2016-03-02T00:00:09Z 29462.1129 null
2016-03-02T00:00:10Z 29438.7375 null
2016-03-02T00:00:00.162000Z null 334373.1
2016-03-02T00:00:00.662000Z null 335152.3
2016-03-02T00:00:01.162000Z null 336404.7
2016-03-02T00:00:01.662000Z null 336180.3
2016-03-02T00:00:02.162000Z null 337903.9
2016-03-02T00:00:02.662000Z null 338509.5
2016-03-02T00:00:03.162000Z null 339290
2016-03-02T00:00:03.662000Z null 339947.7
2016-03-02T00:00:04.162000Z null 339875.4
2016-03-02T00:00:04.662000Z null 341405
2016-03-02T00:00:05.162000Z null 342001.1
2016-03-02T00:00:05.662000Z null 342062.8
2016-03-02T00:00:06.162000Z null 342247.5
2016-03-02T00:00:06.662000Z null 342319.4
2016-03-02T00:00:07.162000Z null 343942
2016-03-02T00:00:07.662000Z null 344033.7
2016-03-02T00:00:08.162000Z null 343942.3
2016-03-02T00:00:08.662000Z null 344034
2016-03-02T00:00:09.162000Z null 347226.6
2016-03-02T00:00:09.662000Z null 347134
2016-03-02T00:00:10.162000Z null 347391.6
2016-03-02T00:00:10.662000Z null 348068.5
With additional parameters a lot of redundant data would be created. Creating multiple files can be also an option, but websites that "spawn" multiple downloads (without direct user request) have often issues as browsers tend to block consequent downloads. But two downloads should be ok or even creating a package (such as zip) if there are more data types.
In general i consider the merged product a nice and simple solution. As (for example) it is possible to filter the Plasma data based on Magnetic data (or any other combination) we actually provide already how the measurements relate and are brought together. All this information is lost separating the data. Maybe when downloading merged data we could show a warning explaining that the product being downloaded is generated and deviates from the original products and a link to additional explanations on how the merge is done and pointing the user the possibility to still download original data when downloading only single collections (not combined collections).
Is it possible to use "null" values also in the cdf cofmat? I have not checked the same combined product in cdf but I think it has the same content.
There are some people that discourage the "interpolation" in the downloaded data. Anyway I can propose them the three options:
Did you have any feedback about this topic at the data quality workshop in Edinburgh?
From what i can see "nan" values are possible in the CDF format. The combined product is basically the equivalent to the CSV, yes. I think the three options sum up the alternatives that are available. Me personally would really recommend the merged option, taking into consideration how it facilitates to work with the data, as in my opinion this is something the scientist would have to do them self anyway if they wanted to investigate any correlation. But maybe this is not how they would like to work with the data and if that is the case we need of course to change to the best option for them.
I presented the feature of being able to merge the data (of multiple data types) which also allows filtering one data type based on filters applied on values of another data type. For this functionality i think we got very good reactions, but i don't think anyone considered that this meant a slight shift in the timestamp (and the need of interpolation) for measurements. In the presentation i described that merging the plasma data was not trivial as it is somewhat "unstable" and (if i remember correctly) i explained we use the nearest neighbor to achieve this and no questions related to this were done.
I have few remarks from a guy who implemented the server code:
@santilland In the current implementation, the missing/non-interpolated values are set to NaN (a proper IEEE float value) and not to null (JavaScript feature).
@lmar76 Please consider that we have to filter the data of all selected products based on various criteria and we have to relate records from different products with different time-stamps. You may select filter on variable of product A and apply to to product B, e.g., you can set filter for magnetic field residual but you want to see the related variables from the plasma product and the nearest-neighbour gives us the closest related records from the latter product.
In principle we could keep the time-stamps of the records from the subordinate products (products interpolated to the master time-line), de-couple the products after filtering and deliver the filtered products separately but we cannot avoid the time-line interpolation during the filtering.
Related to this is handling of the large gaps in the irregularity sampled products. As we observed during the implementation. These gaps can be several hours long and the use of any kind of interpolation is not appropriate (currently we interpolate them). The fix to this (gap-detection) stays in a side GIT branch and has not been yet merged to the staging.
Thank you for the information. For the time being, please leave as it is because we don't have enough feedbacks from the users.
@lmar76 Is this ticket still relevant?
The MAGA_LR_1B and EFIA_PL_1B files of 02/03/2016 (SW_OPER_MAGA_LR_1B_20160302T000000_20160302T235959_Filtered.csv and SW_OPER_EFIA_PL_1B_20160302T000000_20160302T235959_Filtered.csv) downloaded from VirES have been compared to the combined one (SW_OPER_MAGA_LR_1B_SW_OPER_EFIA_PL_1B_20160302T000000_20160302T235959_Filtered.csv) of the same day.
All the measurements contained in the combined file (MAGA_LR_1B-EFIA_PL_1B) have the same timestamps of the MAGA_LR_1B: one measurement per second at every exact second (1 Hz extact UTC). This is ok for the MAGA_LR_1B parameters (there are the same values at the same timestamps), however the measurements of the EFIA_PL_1B have different timestamps (2 Hz). Comparing the values of 'n' parameter present in the EFIA_PL_1B and the values of the same parameter in the combined MAGA_LR_1B-EFIA_PL_1B it seems that the values at secons x.162 have been shifted to second x and the values at seconds x.662 have been discarded. This is not correct because this means that e.g. at time 2016-03-02T00:00:00Z the value of 'n' is 334373.1 and this is not true (it is a deviation from the product).
MAGA_LR_1B
EFIA_PL_1B
Combined MAGA_LR_1B-EFIA_PL_1B