NCAR / xtar

Reading netCDF tar archives with xarray/zarr
MIT License
0 stars 2 forks source link

NFS/Lustre/GPFS, first quick test with mounting worked #4

Open tinaok opened 4 years ago

tinaok commented 4 years ago

mounting tar from ratermount on NFS, lustre & gpfs worked on datarmor.

andersy005 commented 4 years ago

@tinaok, this is looking great! I am curious.., do you know why operations on xarray dataset constructed from the mounted tar archive are faster than the original netCDF files'?

Screen Shot 2020-05-11 at 5 04 36 PM Screen Shot 2020-05-11 at 5 04 21 PM Screen Shot 2020-05-11 at 5 04 31 PM Screen Shot 2020-05-11 at 5 04 14 PM

Cc @kmpaul

kmpaul commented 4 years ago

It is probably the file system cache. In the mounted-tar case, a single tarball is cached for many accesses. How many MPI ranks per node?

andersy005 commented 4 years ago

It is probably the file system cache. In the mounted-tar case, a single tarball is cached for many accesses

Cool! Thank you for the clarification!

How many MPI ranks per node?

Do you mean dask workers per node instead of MPI ranks?

Screen Shot 2020-05-11 at 6 13 40 PM
kmpaul commented 4 years ago

@andersy005 @tinaok Yes. I meant Dask workers per node. It's unclear to me from the Cluster infobox how many workers are on a node.

tinaok commented 4 years ago

Hi, The compute node I used at IFREMER have 28 cores per node. I placed 28 workers per nodes.

Tina Odaka Tel: (+33) 2 98 22 41 85 LOPS - Laboratoire d'Oceanographie Physique et Spatiale UMR 6523 CNRS-IFREMER-IRD-Univ.Brest-IUEM ZI de la Pointe du Diable CS 10070 29280 Plouzané

2020/05/12 16:30、Kevin Paul notifications@github.comのメール:

@andersy005 https://github.com/andersy005 @tinaok https://github.com/tinaok Yes. I meant Dask workers per node. It's unclear to me from the Cluster infobox how many workers are on a node.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCAR/xtar/pull/4#issuecomment-627382255, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALFFE54G4JMNWBZQW5QNU4TRRFMRVANCNFSM4M3PWUDQ.

kmpaul commented 4 years ago

That explains the speedup, then.

With multiple netcdf files tarballed up into one file, that whole file should be cached when it is first read. Every other attempt to read that file will read from cache, since all of those attempts are on the same node with the same memory.

With all of the netcdf files split up (i.e., not tarballed together), then each file is cached when it is first read. But it is never read again, so it has to be deleted from cache to make room for the next file that is read. Hence, the cache is never used.

This is a really good result! If you want to see only the overhead from the mounted tarball, though, you would need to distribute 1 worker per node. Or somehow turn off the cache, which you may not have the ability to do.

kmpaul commented 4 years ago

Obviously, though, this assumes that a single tarball is small enough to fit in memory...?

tinaok commented 4 years ago

Yes, may be with implementation of https://github.com/NCAR/xtar/issues/5? https://github.com/NCAR/xtar/issues/5? , the cache issue could be benched as well? We would need cache-memory specifications of backend controllers for GPFS/Lustre?

Tina Odaka Tel: (+33) 2 98 22 41 85 LOPS - Laboratoire d'Oceanographie Physique et Spatiale UMR 6523 CNRS-IFREMER-IRD-Univ.Brest-IUEM ZI de la Pointe du Diable CS 10070 29280 Plouzané

2020/05/12 17:46、Kevin Paul notifications@github.comのメール:

Obviously, though, this assumes that a single tarball is small enough to fit in memory...?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCAR/xtar/pull/4#issuecomment-627428027, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALFFE5YEOALDOO4TINEBZWLRRFVOPANCNFSM4M3PWUDQ.

kmpaul commented 4 years ago

@tinaok: Yeah, dealing with the caching issue is a regular problem when trying to do I/O benchmarks. One easy way of eliminating the caching issue is to try to use 1 dask worker per compute node. That is horribly inefficient (27 cores go unused!), but it prevents any filesystem caching from occurring. It also allows for potentially better scaling, but that's not really what we care about at this point.