CanadianClimateDataPortal / Canadian-Climate-Data-Portal

3 stars 0 forks source link

Test THREDDS' OPEnDAP subset to find memory size limits #55

Closed tomLandry closed 5 years ago

tomLandry commented 5 years ago

It is known that THREDDS breaks in OPEnDAP queries yielding large chunks of data. Alternative exists, for instance PyDAP. Changing a key component at this point is seen as risky, as we do have great "deployability" and relatively good know-how and support capabilities for THREDDS.

The idea here is to devise and run tests that will progressively stress (bigger and bigger bouding-box and/or larger temporal tranche) the Ouranos server (and eventually the one at CRIM), up until breaking point. We would apply a margin of safety and put that as limitations on the portal. So "file size" limits related to subsets would be tied to memory limits. File size limits for large HTTPS downloads are seen as a lot less risky.

tomLandry commented 5 years ago

Once this is known, a "simple" GIS overlap between a grid with cells elements (bounding boxes really) of Max Size and provinces should do the trick. Get all boxes inside and touching Quebec poly, and so on. We can keep the resulting lists as static ressources.

tlogan2000 commented 5 years ago

Initial tests on accessing raw daily values of BCCAQv2 via opendap indicate a limit of approximately ~45 x 45 gridcells for all time steps (approxmately 111 million values)


import xarray as xr
import numpy as np
url1 = 'http://boreas.ouranos.ca:8083/thredds/dodsC/birdhouse/pcic/BCCAQv2/tasmax_day_BCCAQv2+ANUSPLIN300_MPI-ESM-LR_historical+rcp45_r1i1p1_19500101-21001231.nc'

ds = xr.open_dataset(url1)

for i in np.arange(60,40,-5):
    print('loading data ....')
    dsSub = ds.isel(lon=slice(500, 500+i), lat=slice(100, 100+i))
    try:
        dsSub.load()
        print(i, ' x ', i, ' subset successful for  ', len(dsSub.time), ' timesteps.  Total data size : ', dsSub[list(ds.data_vars)[0]].size)
        break
    except:
        print(i, ' x ', i, ' subset failed for  ', len(ds.time), ' timesteps.  Total data size : ',
              dsSub[list(ds.data_vars)[0]].size)

Output:
loading data ....
60  x  60  subset failed for   55152  timesteps.  Total data size :  198547200
loading data ....
55  x  55  subset failed for   55152  timesteps.  Total data size :  166834800
loading data ....
50  x  50  subset failed for   55152  timesteps.  Total data size :  137880000
loading data ....
45  x  45  subset successful for   55152  timesteps.  Total data size :  111682800
tlogan2000 commented 5 years ago

I do not know the cache size of the thredds server but we could look at changing it as well? see link: https://www.opendap.org/index.php/support/faq/server/large-files-fail

tomLandry commented 5 years ago

Link for THREDDS performance: https://www.unidata.ucar.edu/software/thredds/current/tds/reference/Performance.html

"OPeNDAP Memory Use The OPeNDAP-Java layer of the server currently has to read the entire data request into memory before sending it to the client (we hope to get a streaming I/O solution working eventually). Generally clients only request subsets of large files, but if you need to support large data requests, make sure that the -Xmx parameter above is set accordingly."

tomLandry commented 5 years ago

So worst case scenario @tlogan2000 is at ~10km grid, that would mean roughly 400km squares?

tlogan2000 commented 5 years ago

So worst case scenario @tlogan2000 is at ~10km grid, that would mean roughly 400km squares?

Exactly.

If people are accessing the indices (less time steps by at least a factor of ~30 : daily to monthly) the spatial grid could obviously be much bigger.

dbyrns commented 5 years ago

@davidcaron will do some stress tests on top of that to get an idea of what happens when multiple requests are done simultaneously.

davidcaron commented 5 years ago

@tlogan2000 I got similar results as yours, except I could load data up to a size of 60 x 60 before I got an error.

Here are 2 more things I tried:

Multiple 20 x 20 requests

I ran 40 threads each requesting a 20 x 20 slice:

def test_load():
    def func():
        ds = xr.open_dataset(url1)

        size = 20
        dsSub = ds.isel(lon=slice(500, 500 + size), lat=slice(100, 100 + size))

        print("loading")
        dsSub.load()

    from multiprocessing.pool import ThreadPool

    n_threads = 40
    pool = ThreadPool(processes=n_threads)
    for _ in range(n_threads):
        pool.apply_async(func)

    pool.close()
    pool.join()

    print("ended")

And that completed successfully. So the total size for these requests is 40 20 20 = 16000, roughly equivalent to 125 x 125 grid cells. So it seems that they are not cumulative on the server side...

Use chunks

I added chunk sizes when opening the dataset:

ds = xr.open_dataset(url1, chunks={"lat": 10, "lon": 10})

I ran the same script as you did, and it successfully processed a 150 x 150 grid size. So that seems to help. From what I understand, the chunk size is the maximum size requested at any time from the client side.

tlogan2000 commented 5 years ago

Ok good to know.
Re:chunking, I had thought of that as well but I guess I was unsure of whether portal users would be accessing the data using xarray/python. This was my simple/default way to test the opendap limit of the Thredds server but I think many users of the CCCS portal will not necessarily be python experts and will simply be looking for a 'download' option

tlogan2000 commented 5 years ago

If I understand correctly the multiprocessing tests would simulate ~40 users accessing opendap at once? Or I suppose a single user simultaneously accessing ~40 different .nc files

davidcaron commented 5 years ago

I looked at the error logs on the Thredds server and there's no more information than "OutOfMemoryError"

Ok good to know. Re:chunking, I had thought of that as well but I guess I was unsure of whether portal users would be accessing the data using xarray/python. This was my simple/default way to test the opendap limit of the Thredds server but I think many users of the CCCS portal will not necessarily be python experts and will simply be looking for a 'download' option

I don't know if there would be a way to set a maximum chunk size from the portal...

If I understand correctly the multiprocessing tests would simulate ~40 users accessing opendap at once? Or I suppose a single user simultaneously accessing ~40 different .nc files

Yes, more like a single user requesting 40 times the same file. There is still the possibility that these are generated only once by the server. Next thing would be to try requesting 40 different files.

tomLandry commented 5 years ago

I think we have sufficient information here. We should move that back to Confluence when we'll scramble on data download features (post-March). I'll open a new issue solely for tweaking of THREDDS server. Maybe @ldperron can help with this in CRIM's environment, as an exploratory task?