Explore training data - Githubissues

simonpf commented 1 year ago

An exploratory data analysis should be performed to ensure that the training data is consistent.

Relationship between brightness temperatures and IWP
Spatial coverage
Temporal coverage
Spatial distribution of ice water path

adriaat commented 1 year ago

An exploratory data analysis has been completed, and is compiled in three Jupyter Notebooks. The plots in these notebooks are illustrative and, to some extent, self-explanatory, but here I summarize the findings.

Availability of the data

The two data products we are using (2B-CLDCLASS and 2C-ICE) state that:

P1_R05 is the current version. R04 products will be available until all R05 products have been released.

The number of available R04 and R05 granules was analysed, and the the number of missing R05 granules is relatively tiny.

Coverage of the collocations

The training data we had compiled as of 1 November 2022 was used to do the analysis. Note that running the notebooks can require a substantial amount of computational time and can be resource intensive, requiring more than 60+GB of RAM in some parts.

There are many more GPMIR samples than GridSat (with samples I refer to the 256x256 images extracted with collocations). This is sensible as the number of GPMIR observations is higher (it's half-hourly resolution with respect to 3-hourly). The sample density of GPMIR and GridSat is approximately equal, and so is the number of daytime and nighttime collocations. What is surprising is the spatial coverage: there are few if any samples towards 180º E and towards 90º S, and this may be an issue coming from our code. GPMIR shows an empty area around Australia, and GridSat presents only collocations along certain swaths. These two last remarks may be caused by the data itself, and are also reflected in the grid coverage: only 6.2% of the GridSat grid is covered by collocations, whereas 23.6% of the GPMIR grid has collocations.

The temporal coverage of the collocations is relatively equal and constant for both products.

The analysis of the coverage along latitude reveals a pattern, with many more collocations towards the higher northern latitudes (refer to the plot), and a cyclical pattern along longitude.

Specific data distributions

There are small differences between the distributions of GPMIR brightness temperature (T_B) and GridSat T_B, but it can be reasonable as GridSat only provides collocations for certain swaths.

There are no surprises in the distributions of IWP or T_B, although for IWP there may be a problem coming from the code reflected exactly at 0º N, at 5º N, and towards 180º E. There is a clear IWP vs T_B relationship, but it can vary a lot.[^1]

[^1]: It can be interesting to see if the network summarizes this relationship to a single curve, and then uses this curve to issue a prediction or does something smarter, which it should as it predicts a number of quantiles.

It is also seen that non-zero IWC tends to be more frequent towards the central height in the interval of heights considered, and mean mass height Z_M is also presented.

There is an error for the cloud mask, where its potential solution is commented here, but a cloud mask can be inferred from the cloud class labels of each profile. More than 90% of the cloud classes are no cloud. The cloud class data does have invalid data (more for higher altitudes), but this percentage can be considered tiny, and about 40-50% of the profiles have a cloud, based on the cloud-class-labels-based cloud mask. However, some profiles can have non-zero IWP, and this can be regarded as an inconsistency. Nevertheless, this is only the case for IWP < 1 g/m2, and is likely coming from the subsampling performed for the cloud class labels data.

Finally, we can see that there is some noticeable structure between the cloud class labels and the T_Bs as well as with the IWP by means of a 2-D random projection (other visualization methods were discarded, mainly for being too computationally expensive and dependent on the data, so using a subset could have consequences). We can also see that there is some association strength among cloud class labels and IWCs.

Invalid data: the cloud class labels can present invalid data, but as mentioned it is relatively tiny. No invalid data was found for IWP or IWC (with the exception of one IWC for both products, perhaps something went wrong), but between 96% and 99% of the T_B are invalid data.

Conclusion

The exploratory data analysis has only revealed consistencies of the collocations with prior knowledge (there is a relationship with IR, groupings of cloud classes can be indicative of the magnitude of IWP, etc.), but also has revealed some short-comings of the training data compiled. Some of these issues can be inherent in the data, but others may be connected to our code. Although we have extracted a substantial amount of training data and we are, so far, happy with its quality, the code could be refined for any posterior re-use for further extraction of training or validation data.

adriaat commented 1 year ago

The data exploration of the training data from the second version of the dataset is now ready in https://github.com/adriaat/ccic/tree/data_exploration_v2/notebooks/exploratory_data_analysis.

SEE-GEO / ccic

Explore training data #4

Availability of the data

Coverage of the collocations

Specific data distributions

Conclusion