NCAR / ParallelIO

A high-level Parallel I/O Library for structured grid applications
Apache License 2.0
136 stars 53 forks source link

Async PIO performance #1768

Open EnricoDeg opened 3 years ago

EnricoDeg commented 3 years ago

Hi, I've implemented async IO in a model which was already developed with PIO but with classical IO. I don't see any performance improvement. I checked and there is a synchronization point ("PIOc_sync" function call) at the very beginning of the output and then no other explicit synchronization calls. However, during the variables writing there are several calls to "PIOc_inq" to inquire the variable ID or the variable dimension. My question is: "PIOc_inq" is a function with an implicit synchronization point and so it should be called only before all the writing or it is fine to call this functions between different call to "PIOc_put_vard"? I hope that the question is clear. Can you also give me some advice for best performance in async mode because I didn't find details about that in the Documentation and I would like to avoid reading through the library code.

jedwards4b commented 3 years ago

Although the pio_put_vard calls in async mode should be somewhat asynchronous there are some synchronization points in each call. Note that these are at the beginning of each call and the actual IO operation happens after this.
The real strength of PIO in terms of parallel write are in the pio_write_darray calls when using these calls the data from multiple variables can be aggregated and performance further improved.

edwardhartnett commented 3 years ago

Also to answer your specific question, the PIOc_inq does not do disk I/O. When a file is opened (or as it is created) PIO (and also netCDF) have a list of the metadata in the file, all in memory. When PIOc_inq is called, and it calls the underlying nc_inq(), there is no disk access and no synchronization.

Also take a look in the tests for more async code to use as an example.

Performance is highly variable, so there can be a lot of tuning. Support for the MPE library is built into PIO, which can give you a lot of nice graphs if you want. But it's a fair amount of effort to understand and use MPE...

EnricoDeg commented 3 years ago

Thanks for the advice. Function pio_write_darray is working as expected. I still have a problem, the file closing is very slow compared to everything else (file creation, definition and data writing). Is there also an implicit synchronization point in the closing function? Or is this related to something else? Any idea?

jedwards4b commented 3 years ago

Because the write_darray function accumulates in memory - the file close and sync calls cause the accumulated variables to be written to disk thus the slow time. There is a synchronization point at the end of the closefile function. I'm open to ideas about how to better handle that.

edwardhartnett commented 3 years ago

The way to check is to explicitly call PIOc_sync() before closing the file and see if that's where the delay occurs.

EnricoDeg commented 3 years ago

Is there something that can be done to improve it? In my benchmark the file closure is 7 times more expensive than file creation, definition and data writing. The problem is that the overall IO time is the same of the standard case so I don't see any benefit from asynch IO with PIO.

jedwards4b commented 3 years ago

I'm sorry that you don't see any benefit from async IO - have you tried the synchronous implementation using PIO?

You might also consider implementing the file close call or the file sync call in a child thread so that your application can continue computing in the main thread while the file is written.

EnricoDeg commented 3 years ago

Well but since there is an implicit sync point in PIOc_closefile function, I was not expecting to see a difference from the standard way. I don't know exactly what you mean by synchronous implementation but the model I'm working on was developed with PIO for IO with Intracomm Mode and I wanted to extend it to Async Mode.

Your suggestion to solve this problem is interesting. I'm just wondering if I can use/modify an array that I used for writing before the sync is done. I don't know exactly how this data that must be written to the file are managed by the comp procs. I'm just thinking that if there is a pointer which points to all the data that have to be written then it might be problematic to change the value of the data before those data are written or at least send to the IO procs.

jedwards4b commented 3 years ago

Once the call to write_darray is complete the array has been copied and is safe to modify.

EnricoDeg commented 3 years ago

Okay thanks.

I still have a question on how PIOc_sync works. I thought that when this function is called, the data are sent from comp tasks to IO tasks. However, I've noticed that the data can also be directly flushed to disk. This might be the issue with performance because the writing to disk might be much slower than the communication. Can you explain me the reason for this choice and if there is a way to be sure that comp tasks never write to disk but always send data to IO tasks?

jedwards4b commented 3 years ago

PIO_Sync and PIO_closefile both cause the date to be moved from compute tasks to io tasks and then to disk from IO tasks.

EnricoDeg commented 3 years ago

Okay but once the data are sent by the comp task they are free to move on with the computation? If that's the case, this means that the communication is the problem. If that's not the case, this means that there is some (unnecessary) barrier after that the data have been sent.

jedwards4b commented 3 years ago

The barrier is necessary in order to track the results of and completion of tasks on the IO procs.
You may be able to spin a child thread on the compute tasks to monitor results while computation proceeds. Yes the communication is the problem.

EnricoDeg commented 3 years ago

If there is a barrier on comp tasks to check IO tasks, then the problem might not be the communication but the writing. If comp procs have to wait for IO procs to finish in order to check that everything goes well, then maybe the communication is very fast but writing is slow and comp procs have to wait even if they don't write. Does it make sense to you?

What happens if you avoid to do that check and avoid the final barrier? In principle, this allows to check if the problem is the communication or the waiting for the final check.

EnricoDeg commented 3 years ago

I went a little bit more into the details of PIO library in order to figure out the problem with my application. From my understanding, in sync function the flush_buffer() function is called inside a loop. This function is sending data from comp procs to IO procs and then IO procs are writing the data. This is fine when flush_buffer() is not called inside a loop, but when this happens (because you write variables with different types or different dimensions), then the IO is not asynchronous any more. This explains my problem. In order to make the library fully asynchronous, IO procs need to fill their buffer with all the file variables (like comp procs are doing) and then write them. Please let me know if my explanation makes sense to you because I'm not an expert about this library.

edwardhartnett commented 3 years ago

Can you tell me the file and line number?

EnricoDeg commented 3 years ago

pio_file.c line 420 the flush_buffer is called. At line 415 there is a loop which is looping over group of variables. The variables are grouped over the variable type and the variables' number and type of dimensions. Is it correct? This is what I see with a debugger

jedwards4b commented 3 years ago

You are looking at function PIOc_sync which, by definition, assures that the file is consistent with current program state. It sounds like you want an additional function that assures all data has been moved to IO tasks but not necessarily written to disk?

EnricoDeg commented 3 years ago

Yes I looked at this function because it is the one which is called when the file is closed and that slows down my application. The solution in my opinion would be to move all the groups of variables from comp to IO procs in that loop and then start another loop later which is writing. In this way the comp procs are involved just in the first loop and can move on with the computation while the IO procs are writing. That would make the IO fully asynchronous.

jedwards4b commented 3 years ago

Only compute tasks are inside the block at line 408. I think that the real issue is at line 480 where the compute tasks need to wait for a return code sent from IO tasks. We are still looking for a good option for dealing with this. I suggested implementing a thread in your application where you could call the file close function and wait for the return code while the master thread carries on with computation.

EnricoDeg commented 3 years ago

No sorry but the comp tasks are at line 408 but then flush_buffer function is called and that function called PIOc_write_multi() which involves also the IO tasks. The problem is in the flush_buffer loop. I've investigated that with some timers.

EnricoDeg commented 3 years ago

The thread does not solve the problem because the issue is in the writing process. The library in sync function is doing communication - writing - communication - writing etc. while it should do communication once and then writing once.

jedwards4b commented 3 years ago

Thanks, I see it now. I'll give it some thought.