flux-framework / dyad

DYAD: DYnamic and Asynchronous Data Streamliner
GNU Lesser General Public License v3.0
6 stars 5 forks source link

reduce the number of realpath calls #89

Closed JaeseungYeom closed 5 months ago

JaeseungYeom commented 5 months ago
  1. Cache the result of realpath call in ctx
  2. use the cached realpath in calling cmp_canonical_path_prefix()
  3. reorder the comparisons between managed path prefix and path This PR dependes on PR #91
hariharan-devarajan commented 5 months ago

@JaeseungYeom Real Path is an expensive call. We should have an env variable that must be turned on if users have relative paths. Otherwise, we should assume they pass absolute paths. This will significantly help the dyad.

Also, for comparison, we should use prefix-tree style, where we build a prefix tree for prod-path and cons-path at init, and then comparisons would be much faster during runtime.

JaeseungYeom commented 5 months ago

@JaeseungYeom Real Path is an expensive call. We should have an env variable that must be turned on if users have relative paths. Otherwise, we should assume they pass absolute paths. This will significantly help the dyad.

Also, for comparison, we should use prefix-tree style, where we build a prefix tree for prod-path and cons-path at init, and then comparisons would be much faster during runtime.

We can discuss about prefix-tree. Here I am trying to see if this simple scheme would work. This is mostly for the case of checking a file under a managed path. If the file is not under it, then a better filtering mechanism will be necessary, which is what you recommend. Here, the first check in the cmp_canonical_path_prefix is under the assume that both the prefix and the path are absolute. If that does not work, then the user path is compared against cashed realpath of prefix. Up to this point, no realpath conversion is done except once at the initialization. If they still do not match, the user path needs conversion. However, before doing this step, we can check try other filtering with known prefixes that are not managed.

hariharan-devarajan commented 5 months ago

@JaeseungYeom Real Path is an expensive call. We should have an env variable that must be turned on if users have relative paths. Otherwise, we should assume they pass absolute paths. This will significantly help the dyad. Also, for comparison, we should use prefix-tree style, where we build a prefix tree for prod-path and cons-path at init, and then comparisons would be much faster during runtime.

We can discuss about prefix-tree. Here I am trying to see if this simple scheme would work. This is mostly for the case of checking a file under a managed path. If the file is not under it, then a better filtering mechanism will be necessary, which is what you recommend. Here, the first check in the cmp_canonical_path_prefix is under the assume that both the prefix and the path are absolute. If that does not work, then the user path is compared against cashed realpath of prefix. Up to this point, no realpath conversion is done except once at the initialization. If they still do not match, the user path needs conversion. However, before doing this step, we can check try other filtering with known prefixes that are not managed.

The check under managed directory is exactly why we wanna use a prefix tree comparison.

So the plan is to do the conversion of abs path of producer/consumer path init.

On actual open calls we will do comparison to see if it matches. If not we do real path. Correct ?

JaeseungYeom commented 5 months ago

BTW, realpath is used not only to handle relative path but also symlinks. First solution should be the one without breaking it. Then, you can can add a solution less general but enabled by an environmental variable.

hariharan-devarajan commented 5 months ago

Agreed.

JaeseungYeom commented 5 months ago

Another circumstance to remember is dynamically create paths including symlinks. I create a simplest code to run the function cmp_canonical_path_prefix() in test_cmp_canonical_path_prefix. I am going to check it in temporarily and remove it later when this gets merged.

hariharan-devarajan commented 5 months ago

@JaeseungYeom I don't disagree we have to handle cases but I would not incur it on every case. We should enable flag options to enable a branch of code which is does realpath

hariharan-devarajan commented 5 months ago

@JaeseungYeom for DYAD namespace, generally we don't use paths that exists. Due to cost of comparison.

Instead, they use DYAD: like mpiio: or unifyfs:

This would reduce the cost of comparison.

Also I was wondering if we should hash our managed directory then we we can get the len of chars and do hash comparison.

Thoughts?

JaeseungYeom commented 5 months ago

@JaeseungYeom for DYAD namespace, generally we don't use paths that exists. Due to cost of comparison.

Instead, they use DYAD: like mpiio: or unifyfs:

This would reduce the cost of comparison.

Also I was wondering if we should hash our managed directory then we we can get the len of chars and do hash comparison.

Thoughts?

Hashing does not allow you avoid reading character string of the path provided by user. Each character in a user path will need to be inspected to see where is the directory boundary while computing hashing. If there is a match, then we will have to do the full character comparison from the start. For non-matching case, we will have to compare the hash of managed path and the hash of each directory prefix of the user path. This sounds like a win but to see if the current character is directory deliminator '/', we still have to compare every character of user path to anoher character.

hariharan-devarajan commented 5 months ago

@JaeseungYeom for DYAD namespace, generally we don't use paths that exists. Due to cost of comparison. Instead, they use DYAD: like mpiio: or unifyfs: This would reduce the cost of comparison. Also I was wondering if we should hash our managed directory then we we can get the len of chars and do hash comparison. Thoughts?

Hashing does not allow you avoid reading character string of the path provided by user. Each character in a user path will need to be inspected to see where is the directory boundary while computing hashing. If there is a match, then we will have to do the full character comparison from the start. For non-matching case, we will have to compare the hash of managed path and the hash of each directory prefix of the user path. This sounds like a win but to see if the current character is directory deliminator '/', we still have to compare every character of user path to anoher character.

So I was thinking as follows

  1. Change prefix from actual directory to dyad: this will reduce the size of comparison.
  2. As prefix is of fixed size, we should not do strcmp instead, extract first 6 char like and do hash.
  3. Yes hash will go over string but it's order of n which is 6 here a constant also if u do strcmp as it does 6x6 comparison it's much more expensive.
  4. So with this algo the cost would O(n) but n would be a constant.
  5. With strcmp it would be O(n2) and without DYAD: u would have n dynamic which will increase.
JaeseungYeom commented 5 months ago

I don't follow the "1.". This kind of discussion is is separate from this PR. If there is a way to speed up, I am all years but we do that over webex.

JaeseungYeom commented 5 months ago

I made the logic change we discussed. Now, we filter out non-conforming cases via hash comparison first, then work on matching cases. In addition, I made improvement on other functions to squeeze out some more performance. With the assumption of nonexistent non-conforming case, we can bypass the initial filtering part. I rebased this on PR #91 due to logger related compiler errors. I also removed the dependency on flux core in flux_structures.h, which is included everywhere.