Composite product download

lmar76 commented 7 years ago

The MAGA_LR_1B and EFIA_PL_1B files of 02/03/2016 (SW_OPER_MAGA_LR_1B_20160302T000000_20160302T235959_Filtered.csv and SW_OPER_EFIA_PL_1B_20160302T000000_20160302T235959_Filtered.csv) downloaded from VirES have been compared to the combined one (SW_OPER_MAGA_LR_1B_SW_OPER_EFIA_PL_1B_20160302T000000_20160302T235959_Filtered.csv) of the same day.

All the measurements contained in the combined file (MAGA_LR_1B-EFIA_PL_1B) have the same timestamps of the MAGA_LR_1B: one measurement per second at every exact second (1 Hz extact UTC). This is ok for the MAGA_LR_1B parameters (there are the same values at the same timestamps), however the measurements of the EFIA_PL_1B have different timestamps (2 Hz). Comparing the values of 'n' parameter present in the EFIA_PL_1B and the values of the same parameter in the combined MAGA_LR_1B-EFIA_PL_1B it seems that the values at secons x.162 have been shifted to second x and the values at seconds x.662 have been discarded. This is not correct because this means that e.g. at time 2016-03-02T00:00:00Z the value of 'n' is 334373.1 and this is not true (it is a deviation from the product).

MAGA_LR_1B

Timestamp             F
--------------------------------
2016-03-02T00:00:00Z  29674.1316
2016-03-02T00:00:01Z  29650.4338
2016-03-02T00:00:02Z  29626.7699
2016-03-02T00:00:03Z  29603.1449
2016-03-02T00:00:04Z  29579.5541
2016-03-02T00:00:05Z  29555.9937
2016-03-02T00:00:06Z  29532.4728
2016-03-02T00:00:07Z  29508.9925
2016-03-02T00:00:08Z  29485.5375
2016-03-02T00:00:09Z  29462.1129
2016-03-02T00:00:10Z  29438.7375

EFIA_PL_1B

Timestamp                    n
-------------------------------------
2016-03-02T00:00:00.162000Z  334373.1
2016-03-02T00:00:00.662000Z  335152.3
2016-03-02T00:00:01.162000Z  336404.7
2016-03-02T00:00:01.662000Z  336180.3
2016-03-02T00:00:02.162000Z  337903.9
2016-03-02T00:00:02.662000Z  338509.5
2016-03-02T00:00:03.162000Z  339290
2016-03-02T00:00:03.662000Z  339947.7
2016-03-02T00:00:04.162000Z  339875.4
2016-03-02T00:00:04.662000Z  341405
2016-03-02T00:00:05.162000Z  342001.1
2016-03-02T00:00:05.662000Z  342062.8
2016-03-02T00:00:06.162000Z  342247.5
2016-03-02T00:00:06.662000Z  342319.4
2016-03-02T00:00:07.162000Z  343942
2016-03-02T00:00:07.662000Z  344033.7
2016-03-02T00:00:08.162000Z  343942.3
2016-03-02T00:00:08.662000Z  344034
2016-03-02T00:00:09.162000Z  347226.6
2016-03-02T00:00:09.662000Z  347134
2016-03-02T00:00:10.162000Z  347391.6
2016-03-02T00:00:10.662000Z  348068.5

Combined MAGA_LR_1B-EFIA_PL_1B

Timestamp             F           n
------------------------------------------
2016-03-02T00:00:00Z  29674.1316  334373.1
2016-03-02T00:00:01Z  29650.4338  336404.7
2016-03-02T00:00:02Z  29626.7699  337903.9
2016-03-02T00:00:03Z  29603.1449  339290
2016-03-02T00:00:04Z  29579.5541  339875.4
2016-03-02T00:00:05Z  29555.9937  342001.1
2016-03-02T00:00:06Z  29532.4728  342247.5
2016-03-02T00:00:07Z  29508.9925  343942
2016-03-02T00:00:08Z  29485.5375  343942.3
2016-03-02T00:00:09Z  29462.1129  347226.6
2016-03-02T00:00:10Z  29438.7375  347391.6

santilland commented 7 years ago

It is only possible to "merge" the products using the same timestamp. We use, as we described, the nearest neighbor "interpolation" to achieve this. Keeping in a sense "original" values. We could think of using other interpolation methods (linear, etc) which can introduce some issues especially in edge cases, which are fairly common in the plasma data (a lot of holes in the data etc) where any kind of interpolation could create anomalies. Do you have another approach for "merging" the data which you consider more suited?

lmar76 commented 7 years ago

I have discussed with some people about the issue and they suggested that "merging" data is fine while visualizing the parameters in the scatter plot area and/or in the globe map but the downloaded data should not be merged (because this is a deviation from the original products). So, they suggested to maintain separated the parameters coming from distinct collections either separating them in the same file (but I don't know if this can be done) or download multiple files (one per collection). What do you think?

santilland commented 7 years ago

The issue is not trivial, in principle we can write this in one file, but it is less then ideal. For example as CSV we could have the header as shown in your example above "Timestamp, F, n" . For Timestamps which correspond to magnetic data F would have a value and n null and for plasma the other way around.

Combined MAGA_LR_1B-EFIA_PL_1B

Timestamp             F           n
------------------------------------------
2016-03-02T00:00:00Z  29674.1316 null
2016-03-02T00:00:01Z  29650.4338 null
2016-03-02T00:00:02Z  29626.7699 null
2016-03-02T00:00:03Z  29603.1449 null
2016-03-02T00:00:04Z  29579.5541 null
2016-03-02T00:00:05Z  29555.9937 null
2016-03-02T00:00:06Z  29532.4728 null
2016-03-02T00:00:07Z  29508.9925 null
2016-03-02T00:00:08Z  29485.5375 null
2016-03-02T00:00:09Z  29462.1129 null
2016-03-02T00:00:10Z  29438.7375 null
2016-03-02T00:00:00.162000Z null 334373.1
2016-03-02T00:00:00.662000Z null 335152.3
2016-03-02T00:00:01.162000Z null 336404.7
2016-03-02T00:00:01.662000Z null 336180.3
2016-03-02T00:00:02.162000Z null 337903.9
2016-03-02T00:00:02.662000Z null 338509.5
2016-03-02T00:00:03.162000Z null 339290
2016-03-02T00:00:03.662000Z null 339947.7
2016-03-02T00:00:04.162000Z null 339875.4
2016-03-02T00:00:04.662000Z null 341405
2016-03-02T00:00:05.162000Z null 342001.1
2016-03-02T00:00:05.662000Z null 342062.8
2016-03-02T00:00:06.162000Z null 342247.5
2016-03-02T00:00:06.662000Z null 342319.4
2016-03-02T00:00:07.162000Z null 343942
2016-03-02T00:00:07.662000Z null 344033.7
2016-03-02T00:00:08.162000Z null 343942.3
2016-03-02T00:00:08.662000Z null 344034
2016-03-02T00:00:09.162000Z null 347226.6
2016-03-02T00:00:09.662000Z null 347134
2016-03-02T00:00:10.162000Z null 347391.6
2016-03-02T00:00:10.662000Z null 348068.5

With additional parameters a lot of redundant data would be created. Creating multiple files can be also an option, but websites that "spawn" multiple downloads (without direct user request) have often issues as browsers tend to block consequent downloads. But two downloads should be ok or even creating a package (such as zip) if there are more data types.

In general i consider the merged product a nice and simple solution. As (for example) it is possible to filter the Plasma data based on Magnetic data (or any other combination) we actually provide already how the measurements relate and are brought together. All this information is lost separating the data. Maybe when downloading merged data we could show a warning explaining that the product being downloaded is generated and deviates from the original products and a link to additional explanations on how the merge is done and pointing the user the possibility to still download original data when downloading only single collections (not combined collections).

lmar76 commented 7 years ago

Is it possible to use "null" values also in the cdf cofmat? I have not checked the same combined product in cdf but I think it has the same content.

There are some people that discourage the "interpolation" in the downloaded data. Anyway I can propose them the three options:

one single file with "null" values where the parameter is not available for a given timestamp
multiple files packed in a ZIP
merged product with interpolated values + warnings and explanation (eventually also some details on the algorithm in the FAQ)

Did you have any feedback about this topic at the data quality workshop in Edinburgh?

santilland commented 7 years ago

From what i can see "nan" values are possible in the CDF format. The combined product is basically the equivalent to the CSV, yes. I think the three options sum up the alternatives that are available. Me personally would really recommend the merged option, taking into consideration how it facilitates to work with the data, as in my opinion this is something the scientist would have to do them self anyway if they wanted to investigate any correlation. But maybe this is not how they would like to work with the data and if that is the case we need of course to change to the best option for them.

I presented the feature of being able to merge the data (of multiple data types) which also allows filtering one data type based on filters applied on values of another data type. For this functionality i think we got very good reactions, but i don't think anyone considered that this meant a slight shift in the timestamp (and the need of interpolation) for measurements. In the presentation i described that merging the plasma data was not trivial as it is somewhat "unstable" and (if i remember correctly) i explained we use the nearest neighbor to achieve this and no questions related to this were done.

pacesm commented 7 years ago

I have few remarks from a guy who implemented the server code:

@santilland In the current implementation, the missing/non-interpolated values are set to NaN (a proper IEEE float value) and not to null (JavaScript feature).

@lmar76 Please consider that we have to filter the data of all selected products based on various criteria and we have to relate records from different products with different time-stamps. You may select filter on variable of product A and apply to to product B, e.g., you can set filter for magnetic field residual but you want to see the related variables from the plasma product and the nearest-neighbour gives us the closest related records from the latter product.

In principle we could keep the time-stamps of the records from the subordinate products (products interpolated to the master time-line), de-couple the products after filtering and deliver the filtered products separately but we cannot avoid the time-line interpolation during the filtering.

Related to this is handling of the large gaps in the irregularity sampled products. As we observed during the implementation. These gaps can be several hours long and the use of any kind of interpolation is not appropriate (currently we interpolate them). The fix to this (gap-detection) stays in a side GIT branch and has not been yet merged to the staging.

lmar76 commented 7 years ago

Thank you for the information. For the time being, please leave as it is because we don't have enough feedbacks from the users.

pacesm commented 5 years ago

@lmar76 Is this ticket still relevant?

ESA-VirES / WebClient-Framework

Composite product download #169