NCAR / DART

Data Assimilation Research Testbed
https://dart.ucar.edu/
Apache License 2.0
196 stars 145 forks source link

Feature request: Python tools for obs sequences #742

Open hkershaw-brown opened 1 month ago

hkershaw-brown commented 1 month ago

Use case

For CROCODILE, Python based tools for observation space diagnostics. Might be useful more generally for DART, so adding this issue to track.

Is your feature request related to a problem?

Originally for CROCODILE obs space diaginostic plotting in the python ecosystem, but the ability to examine obs squences in a dataframe in a Jupyter notebook (or Python tool of your choice) is quite helpful, e.g. finding duplicates in obs sequences, looking at output from obs converters, subsetting observations (in space, time, or by X), splitting and joining obs sequences. No need to run obs_diag to bin observations, you can read the obs_sequence into a dataframe directly.

Example finding duplicates: Screenshot 2024-09-26 at 3 37 33 PM

Describe your preferred solution

https://github.com/NCAR/pyDARTdiags. See issues for various notes and docs for documentation. https://pypi.org/project/pydartdiags/ (but recommend you do a local editable pip install if you are developing this or playing with it) BUYER BEWARE, this is bleeding edge.

Describe any alternatives you have considered

Currently using pandas, which seems ok (tried naively loading 20GB obs sequences one after the other, actually worked on my mac). Probably need to think about big-big data tools for going larger (and maybe faster). Also keeping notes on other observation tools (https://github.com/NCAR/pyDARTdiags/discussions/4).

kdraeder commented 12 hours ago

I worked through the Quickstart guide as a fairly naive xxPyxx user. Here are some of the hurdles I dealt with and suggestions.

I don't have python on my laptop, so I had to decide whether to install it or find it somewhere else. Looking for how to install it on my laptop led to too many choices, about which I know almost nothing. I opted for derecho andor casper.

Summary of the CLI efforts; partially successful.

[These instructions would have been more helpful to me.]

Convert your obs_seq file(s?) to ASCII, if needed.
   Will this always be necessary?
   Multiple files?
Run on supercomputers (or where python3 is installed).
> bash (as instructed in dartdiags/bin/activate)
Other python instructions I've seen say to load the conda module,
to manage python packages.
> module load conda
`module` is not a command in my bash environment.
But that environment has access to the conda command.  
  and python -> python3.10.
Proceeding without conda.
  > python3 -m venv dartdiags
  > source dartdiags/bin/activate
  > pip install pydartdiags
  > python -  (Run interactively, apparently a "basic" session,
>>> from pydartdiags.obs_sequence import obs_sequence as obsq
>>> from pydartdiags.plots import plots
Now it's ready to run diagnostics.
>>> obs_seq = obsq.obs_sequence('$your_path/obs_seq.final.ascii')
I wasn't aware of how to find a list of functions.  Once I learned that they
are in `plots`, I could see them with
>>> help (plots)
>>> obs_seq.df.head()
Shows what I expect.
>>> df_qc0 = obsq.select_by_dart_qc(obs_seq.df, 0)
>>> plots.plot_rank_histogram(df_qc0)
Gives a window that is blank except for '[][][] Viewing<>' at the bottom left.
It turns out that this is a vi window with no contents in the file.
I had to kill the vi, wm3, and sh processes in order to get to a place
where I could proceed.
Helen says that "The plots are using plottly which outputs to a browser.
Use a jupyter notebook, e.g. use jupyter hub on Derecho."

Adapting Quickstart to NCAR's Jupyterhub.

Make soft links, from any directories or files in /scratch which I want to use,
into ~$user.
https://jupyterhub.hpc.ucar.edu/stable/hub/home  is the place to start.
I chose Default.  It opened a new tab.
I chose 'casper login' to use that resource (for this low intensity activity).
It took me to a page showing me lots of apps in a "Notebooks" section,
the same apps in ">_ Console" section, and a few in "$_ Other"
such as "Terminal" and a bunch of different file formats; text, julia, R, ...
I chose the Python3 notebook (based on it being the first listed).

Jupyterhub complaint: I didn't know how to load a module, so I opened a Help:Jupyterlab Reference. I wanted to search for "load module", but I couldn't enter text in the Search bubble, and the \<ctrl>-k it suggests doesn't do anything. Get Started has no 'Search' or 'module' entries. User Guide: has Searching(!) It says use \<cmd>-f for Macs to use the built-in shortcut. That makes the Firefox Edit button flash, but nothing appears which would let me enter text. The \<ctrl>-f box is still visible in the lower left, but that's (?) the browser page search, which is supposedly different.

[]: import pydartdiags        This enabled the `from` commands to work.
[]: from pydartdiags.obs_sequence import obs_sequence as obsq
[]: from pydartdiags.plots import plots
File needs to be a ascii obs_seq.final file.
[]: obs_seq = obsq.obs_sequence('/glade/derecho/scratch/raeder/OSSE_BNRH_3mem/run/OSSE_BNRH_3mem.dart.e.cam_obs_seq_final.2018-01-01-21600')
[]: obs_seq.df.head()     works
[]: df_qc0 = obsq.select_by_dart_qc(obs_seq.df, 0)
[]: plots.plot_rank_histogram(df_qc0)
No visible errors, but nothing shows in the space below the command.
Then I get a new command line.
Same for
[]: df_profile, figrmse, figbias = plots.plot_profile(df_qc0, plevels)

I saved the notebook into '~raeder/from_works_no_pix.ipynb'.

Another jupyterhub complaint:
the file browser on the left is stuck on the status from when I first started the jupyterhub tab. The refresh button doesn't help. I found the file by searching for its name.

I opened 'from_works_no_pix.ipynb' and the pictures appeared (!). The rank histogram seemed to show no data. It turned out that I needed to click on the variable I wanted to see. Then I stumbled on a way to compact the count axis to see the tops near 5000. Choosing both vars stacks one on top of the other, which was surprising.

A third jupyterhub complaint: "Magic" commands, such as "%load", are not found using help(%load) or any variant of that I could think of. I had to look for the commands online.