benbovy / PyGChem

A Python interface to the GEOS-Chem Model, a global 3-D chemical transport model (CTM) for atmospheric composition
GNU General Public License v3.0
10 stars 4 forks source link

Backend for loading xarray Datasets from BPCH files #6

Open darothen opened 7 years ago

darothen commented 7 years ago

WORK IN PROGRESS!!

these are dev notes while I work on this, but I welcome pulls/reviews/comments

I thought it would be nice to go ahead and open a Pull Request documenting some of this work. I mostly built off the scaffolding already available for reading data BPCH files into blobs (BPCHDataProxy), and tied it up in such a way that it was fast to skim/parse a file, and data loading could be deferred until necessary.

This works fine with some basic datasets. Datasets with multiple lev coordinates aren't yet supported (easy enough to do so) and there's a litany of features to implement, in particular reading the GEOS grids with more metadata attached to them.

The biggest issue is that using the available infrastructure is extremely inefficient when it comes to loading the files; basically, the entire file needs to be read into memory before data can be accessed. One way I'm getting around this is to use the NDArrayMixin from xarray, which lets you slice into BPCHDataProxys... but still eagerly loads the data. I added a load_memmap method to BPCHDataProxy which hooks into dask to get around this, and it's fantastic - I can read a ~2 GB BPCH file nearly instantly, and do operations on just the slices and subsets of data that I need.

A long-term solution would be to write an interface to the entire BPCH file instead of building chunks. The reason for this is that the output in the BPCH file are not contiguous; different timesteps are saved in different places. I need to build some more scaffolding to manage this; my naive approach works, but again eager loading kills performance since you end up having to read the entire dataset to construct the array that you want. Storing each memmapped array separately in a given BPCHDataProxy instance might get around this, with the help of some fancy logic getitem. Really, anything that bypasses reading all the data at once will work.