NOAA-EMC / bufr-query

Apache License 2.0
1 stars 2 forks source link

The output NetCDF files are not identical between runs with MPI and without MPI #14

Closed emilyhcliu closed 3 weeks ago

emilyhcliu commented 1 month ago

@rmclaren I tested converting the satwind BUFR file using bufr2netcdf.x with and without MPI.
The output files are not consistent (identical) from tests with MPI and without MPI.

The output files from NO MPI run

17067320 gdas.t00z.satwind_abi_goes-16.tm00.nc --- data count:  965550
30718782 gdas.t00z.satwind_abi_goes-17.tm00.nc --- data count:  1799711
18687  gdas.t00z.satwind_abi_goes-18.tm00.nc  --- data count: 0

The output files from MPI run ( srun -n 12)

 57349351 gdas.t00z.satwind_abi_goes-16.tm00.nc --- data count: 3115825
 76874918 gdas.t00z.satwind_abi_goes-17.tm00.nc  --- data count: 4301847
 18687  gdas.t00z.satwind_abi_goes-18.tm00.nc  --- data count:  0

Also, MPI runs with 2, 4, or 8 processors did not run to completion successfully. They were hung until they ran out of the node allocation time. The test ran to completion with 12 processors. However, the output files are not consistent with the NO MPI output.

The input, configuration, mapping, and script files can be found on HERA:

/scratch1/NCEPDEV/da/Emily.Liu/EMC-bufr-query/run_satwind
./process_bufr2netcdf 
./process_bufr2netcdf_mpi  4
./process_bufr2netcdf_mpi 12

The CrIS and IASI output files are consistent between MPI and No MPI runs.
But, output files are not consistent for satwind.

The satwind mapping file is similar to CrIS and IASI, except that the satwind mapping has subsets defined as the following:

bufr:
  subsets:
    - NC005030
    - NC005031
    - NC005032
    - NC005034
    - NC005039

  variables:
    # MetaData
    timestamp:
      datetime:
        year: "*/YEAR"
        month: "*/MNTH"
        day: "*/DAYS"
        hour: "*/HOUR"
        minute: "*/MINU"
        second: "*/SECO"

    latitude:
      query: "*/CLATH"

    longitude:
      query: "*/CLONH"

    satelliteId:
      query: "*/SAID"

    satelliteZenithAngle:
      query: "*/SAZA"

         ------

Is it possible that the MPI is not implemented correctly for data mapping with subsets defined?

rmclaren commented 1 month ago

Thanks, I'll take a look

emilyhcliu commented 1 month ago

@rmclaren Tested the satwind again with MPI using your latest update for feature/in_parallel

Test with MPI = 12

ls -l ../bufr_backend_mpi12/*
-rw-r--r-- 1 Emily.Liu da 17067320 Jul 24 16:22 gdas.t00z.satwind_abi_goes-16.tm00.nc
-rw-r--r-- 1 Emily.Liu da 30718782 Jul 24 16:22 gdas.t00z.satwind_abi_goes-17.tm00.nc
-rw-r--r-- 1 Emily.Liu da    18687 Jul 24 16:22 gdas.t00z.satwind_abi_goes-18.tm00.nc

Test without MPI

ls -l ../bufr_backend/*
-rw-r--r-- 1 Emily.Liu da 17067320 Jul 24 16:19 ../bufr_backend/gdas.t00z.satwind_abi_goes-16.tm00.nc
-rw-r--r-- 1 Emily.Liu da 30718782 Jul 24 16:19 ../bufr_backend/gdas.t00z.satwind_abi_goes-17.tm00.nc
-rw-r--r-- 1 Emily.Liu da    18687 Jul 24 16:19 ../bufr_backend/gdas.t00z.satwind_abi_goes-18.tm00.nc

The size and number of observations of the output files matched with and without MPI.

I will check if the results are reproducible with various MPI numbers I will test script backend with and without MPI Will report back soon.

emilyhcliu commented 1 month ago

@rmclaren The output files are reproducible with varying MPI numbers from 4, 8, 12 and 24

[Emily.Liu@hfe06 bufr_backend]$ ls -l
total 46688
-rw-r--r-- 1 Emily.Liu da 17067320 Jul 24 16:19 gdas.t00z.satwind_abi_goes-16.tm00.nc
-rw-r--r-- 1 Emily.Liu da 30718782 Jul 24 16:19 gdas.t00z.satwind_abi_goes-17.tm00.nc
-rw-r--r-- 1 Emily.Liu da    18687 Jul 24 16:19 gdas.t00z.satwind_abi_goes-18.tm00.nc
[Emily.Liu@hfe06 bufr_backend]$ ls -l ../bufr_backend_mpi4
total 46688
-rw-r--r-- 1 Emily.Liu da 17067320 Jul 24 16:30 gdas.t00z.satwind_abi_goes-16.tm00.nc
-rw-r--r-- 1 Emily.Liu da 30718782 Jul 24 16:30 gdas.t00z.satwind_abi_goes-17.tm00.nc
-rw-r--r-- 1 Emily.Liu da    18687 Jul 24 16:30 gdas.t00z.satwind_abi_goes-18.tm00.nc
[Emily.Liu@hfe06 bufr_backend]$ ls -l ../bufr_backend_mpi8
total 46688
-rw-r--r-- 1 Emily.Liu da 17067320 Jul 24 16:30 gdas.t00z.satwind_abi_goes-16.tm00.nc
-rw-r--r-- 1 Emily.Liu da 30718782 Jul 24 16:30 gdas.t00z.satwind_abi_goes-17.tm00.nc
-rw-r--r-- 1 Emily.Liu da    18687 Jul 24 16:30 gdas.t00z.satwind_abi_goes-18.tm00.nc
[Emily.Liu@hfe06 bufr_backend]$ ls -l ../bufr_backend_mpi12
total 46688
-rw-r--r-- 1 Emily.Liu da 17067320 Jul 24 16:22 gdas.t00z.satwind_abi_goes-16.tm00.nc
-rw-r--r-- 1 Emily.Liu da 30718782 Jul 24 16:22 gdas.t00z.satwind_abi_goes-17.tm00.nc
-rw-r--r-- 1 Emily.Liu da    18687 Jul 24 16:22 gdas.t00z.satwind_abi_goes-18.tm00.nc
[Emily.Liu@hfe06 bufr_backend]$ ls -l ../bufr_backend_mpi24
total 46688
-rw-r--r-- 1 Emily.Liu da 17067320 Jul 24 16:31 gdas.t00z.satwind_abi_goes-16.tm00.nc
-rw-r--r-- 1 Emily.Liu da 30718782 Jul 24 16:31 gdas.t00z.satwind_abi_goes-17.tm00.nc
-rw-r--r-- 1 Emily.Liu da    18687 Jul 24 16:31 gdas.t00z.satwind_abi_goes-18.tm00.nc
[Emily.Liu@hfe06 bufr_backend]$ 

The run with MPI=12 finished in 12 seconds The run without MPI finished in 144 seconds

emilyhcliu commented 1 month ago

The script backend for satwind also worked with MPI. The 24 processors took 15 seconds. 3 seconds more than the bufr backend because python is involved and three new variables were added.

This is great! Good job!!

rmclaren commented 1 month ago

@emilyhcliu Thanks for your patience and for testing everything :)