NCAS-CMS / PyActiveStorage

Python implementation of Active Storage
2 stars 2 forks source link

[e2eTESTING] V tests: Kerchunk vs Pyfive engines #191

Open valeriupredoi opened 6 months ago

valeriupredoi commented 6 months ago

Local tests on V Computer

Test code:

import os
import numpy as np

from activestorage.active import Active

S3_ACTIVE_URL_Bryan = "https://192.171.169.248:8080"
S3_BUCKET = "bnl"

def gold_test():
    """Run somewhat as the 'gold' test."""
    storage_options = {
        'key': "f2d55c6dcfc7618b2c34e00b58df3cef",
        'secret': "$/'#M{0{/4rVhp%n^(XeX$q@y#&(NM3W1->~N.Q6VP.5[@bLpi='nt]AfH)>78pT",
        'client_kwargs': {'endpoint_url': "https://uor-aces-o.s3-ext.jc.rl.ac.uk"},
    }
    active_storage_url = "https://192.171.169.248:8080"
    bigger_file = "ch330a.pc19790301-bnl.nc"

    test_file_uri = os.path.join(
        S3_BUCKET,
        bigger_file
    )
    print("S3 Test file path:", test_file_uri)
    active = Active(test_file_uri, 'UM_m01s16i202_vn1106', storage_type="s3",
                    storage_options=storage_options,
                    active_storage_url=active_storage_url)
    # old test with 3GB file
    # active2 = Active(test_file_uri, 'm01s06i247_4', storage_type="s3",
    #                 storage_options=storage_options,
    #                 active_storage_url=active_storage_url)

    active._version = 1
    active._method = "min"

    result = active[:]
    # result = active[0:3, 4:6, 7:9]  # standardized slice

    print("Result is", result)
    return result

Kerchunk is restricted to Dataset of interest:

Looking only at a single Dataset <HDF5 dataset "UM_m01s16i202_vn1106": shape (40, 1920, 2560), type "<f4">

Chunks

Both Kerchunk and Pyfive send variable (give or take 5 or 10) numbers of chunks to Reductionist; order of magnitude is 3360 chunks.

Kerchunk-based Pipeline

Result is 4677.8594 (stable)

Kerchunk indexing and JSON file writing times:

Pyfive-based pipeline

Result is 4677.8594 (stable)

Sliced Kerchunk (slice [0:3, 4:6, 7:9])

Sliced Pyfive (slice [0:3, 4:6, 7:9])

valeriupredoi commented 6 months ago

On JASMIN/sci2


CPU:

     *-cpu
          product: Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz
          vendor: Intel Corp.
          vendor_id: GenuineIntel
          physical id: 1
          bus info: cpu@0
          version: 6.58.0
          width: 64 bits

Kerchunk-based Pipeline

Result is 4677.8594 (stable)

Kerchunk indexing and JSON file writing times:

Pyfive-based pipeline

valeriupredoi commented 6 months ago

Question no 1

Answer

@bnlawrence suggests chunking, and he is correct: 2.8G file field has 30 chunks, the other field has 3400 chunks -> here's the penalty factor right there!

valeriupredoi commented 6 months ago

Use -def file (64 HDF5 chunks)

Kerchunk-based pipeline

My computer (UoR network etc)

Kerchunk indexing and JSON file writing times:

Time before going into Reductionist:

Pyfive-based pipeline

Time before going into Reductionist:

valeriupredoi commented 6 months ago

so it's starting to look like this:

Kerchunk-based pipeline

Kerchunk indexer:

To (network) and at Reductionist

Total time