flux-framework / dyad

DYAD: DYnamic and Asynchronous Data Streamliner
GNU Lesser General Public License v3.0
6 stars 5 forks source link

Bug Fix for shared storage #81

Closed hariharan-devarajan closed 5 months ago

hariharan-devarajan commented 5 months ago

In case of shared storage, we should not use KVS to synchronize Producer and Consumer. Instead, we first use FS locks to do the sync, if that fails (say consumer got the lock first), we fall back to the case of KVS for sync.

This allows us to optimize our KVS calls by calling it only in case, locks looks the order battle ;).

Note, if use streams, there is no portable way to do locks. Therefore, this PR will use KVS only for that case.

JaeseungYeom commented 5 months ago

I see the performance benefit for using file locking to synchronize between producer and consumer when they have access to the same storage. That is not a typical scenario intended for DYAD unless to deal with the hierarchical storage. We will have to bring back the hierarchical storage management related development once we are done with current publication efforts. However, the description of the bug is lacking here. If I understand the situation, this bug is introduced by the file locking optimization on client side. We no longer check KVS as the first method to determine the local availability of a file. Instead, we check if a file is locally available by opening it and putting a lock on it before checking KVS. If the producer has already opened a file but has not finished, the consumer will see a file and continue reading an incomplete file. Checking if a file size is greater than 0 would not be sufficient. So, I think for a shared storage, we probably should use KVS only even on consumer side.

JaeseungYeom commented 5 months ago

One solution could be to create a file (e.g., node_id_filename.lock) as a lock to indicate that writing is not complete before opening a file to write. Upon closing of the file, the producer removes the lock file. Consumer checks the existence of the local file instead of relying on the file size.

hariharan-devarajan commented 5 months ago

One solution could be to create a file (e.g., node_id_filename.lock) as a lock to indicate that writing is not complete before opening a file to write. Upon closing of the file, the producer removes the lock file. Consumer checks the existence of the local file instead of relying on the file size.

Another option is to do stream I/O with fd. This would open and write data into fd while the interface uses streams.

hariharan-devarajan commented 5 months ago

I see the performance benefit for using file locking to synchronize between producer and consumer when they have access to the same storage. That is not a typical scenario intended for DYAD unless to deal with the hierarchical storage. We will have to bring back the hierarchical storage management related development once we are done with current publication efforts. However, the description of the bug is lacking here. If I understand the situation, this bug is introduced by the file locking optimization on client side. We no longer check KVS as the first method to determine the local availability of a file. Instead, we check if a file is locally available by opening it and putting a lock on it before checking KVS. If the producer has already opened a file but has not finished, the consumer will see a file and continue reading an incomplete file. Checking if a file size is greater than 0 would not be sufficient. So, I think for a shared storage, we probably should use KVS only even on consumer side.

No, I added locks on producer side as well. So everything would be synchronized. For shared storage case with streams, as streams don't have locks we don't use locks to synchronize.

JaeseungYeom commented 5 months ago

I see the performance benefit for using file locking to synchronize between producer and consumer when they have access to the same storage. That is not a typical scenario intended for DYAD unless to deal with the hierarchical storage. We will have to bring back the hierarchical storage management related development once we are done with current publication efforts. However, the description of the bug is lacking here. If I understand the situation, this bug is introduced by the file locking optimization on client side. We no longer check KVS as the first method to determine the local availability of a file. Instead, we check if a file is locally available by opening it and putting a lock on it before checking KVS. If the producer has already opened a file but has not finished, the consumer will see a file and continue reading an incomplete file. Checking if a file size is greater than 0 would not be sufficient. So, I think for a shared storage, we probably should use KVS only even on consumer side.

No, I added locks on producer side as well. So everything would be synchronized. For shared storage case with streams, as streams don't have locks we don't use locks to synchronize.

I see that now. Then I would like to go over again with the full context in mind. However, can you rebase?

hariharan-devarajan commented 5 months ago

@JaeseungYeom I rebased it.

JaeseungYeom commented 5 months ago

I am currently doing something not so portable in another part of the code https://github.com/flux-framework/dyad/blob/6d3df1b5f680ca7295a1e46e7aa731a9982636e6/include/dyad/stream/dyad_stream_api.hpp#L65 based on: https://stackoverflow.com/questions/676787/how-to-do-fsync-on-an-ofstream

This is a confession of my sin. If there is a way to make it portable, we can leverage that for locking.

hariharan-devarajan commented 5 months ago

I am currently doing something not so portable in another part of the code

https://github.com/flux-framework/dyad/blob/6d3df1b5f680ca7295a1e46e7aa731a9982636e6/include/dyad/stream/dyad_stream_api.hpp#L65

based on: https://stackoverflow.com/questions/676787/how-to-do-fsync-on-an-ofstream This is a confession of my sin. If there is a way to make it portable, we can leverage that for locking.

I am not sure what you are doing here. But if u get a file descriptor, then u can add a similar logic to the open and close code I have for the producer case, and then it would work.