darshan-hpc / darshan

Darshan I/O characterization tool
Other
55 stars 27 forks source link

WIP glob feature #936

Open yariseidenbenz opened 1 year ago

yariseidenbenz commented 1 year ago

Related to gh-749

This pull request introduces the glob_feature.py script, which enhances the functionality of the existing HTML PyDarshan report. The purpose of this script is to provide more detailed information about the file paths accessed by the Darshan log file specified on the command line.

Currently, the PyDarshan report displays the list of accessed files without specific details. With the glob_feature.py script, we aim to address this limitation by generating a condensed and organized data frame. This data frame consists of two columns: filename_glob and glob_count.

The filename_glob column contains the file paths being read by the Darshan log file, providing a clear representation of the accessed files. The glob_count column indicates the number of times each file path occurs, offering insights into the frequency of access.

Furthermore, this script goes beyond the common prefixes of file paths and includes instances where there is additional information beyond the last common directory. These cases are represented by the .* regex pattern, emphasizing the diversity of the file paths.

We hope to eventually incorporate this script into the PyDarshan summary report, in order to enhance the understanding of file access patterns and provide more comprehensive information to users. Your feedback and suggestions for further improvements are welcome.

shanedsnyder commented 1 year ago

Agreed on Tyler's feedback, specifically:

So, in the 2nd case there's probably some additional work needed to find out exactly where the "matching" globs differ, with us clearly marking where this difference occurs.

In my opinion, I don't think we need to be super elaborate in terms of how we indicate what the non-matching parts of a glob are. E.g., above I don't think we need to mark that the [.*] portion of the glob is a number (\d). My thought process there is just that it seems like it could produce some complicated regular expressions that might be confusing for users. I think most helpful is just knowing exactly where the globs differ, not necessarily how. You could also maybe offer like a verbose option so users could have all of the filenames associated with a glob printed out to see exactly what all is included, or something like that.

I'll try to share some other logs I have that have much more data (i.e., filenames to glob) that we could use to see how we handle more complicated use cases -- I have them, just need to make them available at darshan-logs repo.

tylerjereddy commented 1 year ago

I think the main additions I'd want to see near-term are:

tylerjereddy commented 1 year ago

If I hack around some of the bugs noted above, I can finally get your script to produce a table for e3sm_io_heatmap_only.darshan:

image

That looks a bit closer to what we want, though I'm not sure why you decide to remove the heatmap-style coloring from the table?

The process for moving forward might then be to ask the Argonne folks for feedback on that table given the raw/original filenames:

16592106915301738621                                      heatmap:POSIX
7713204788897956166                                darshan-apmpi-header
2250748219237254214                                               APMPI
3748348667804457392   /projects/radix-io/E3SM-IO-inputs/i_case_1344p.nc
3668870418325792824                                       heatmap:MPIIO
14734109647742566553                                            <STDIN>
15920181672442173319                                           <STDOUT>
7238257241479193519                                            <STDERR>
3989511027826779520                                       heatmap:STDIO
6966057185861764086      /projects/radix-io/snyder/e3sm/can_I_out_h0.nc
12386309978061520508     /projects/radix-io/snyder/e3sm/can_I_out_h1.nc

From Shane's feedback above, it sounds like we might want to switch \d to the more general [.*] or whatever to keep it simple for users. So you'd adjust your pytest test based on Shane's feedback, then make it pass again by adjusting your code accordingly. Then move on to the next log file, and so on, trying to reach consensus on the expected results as you go, and enforcing that the tests still pass for the old and new logs as you go, which should force the code to generalize more and more as the test cases get bigger/trickier.

tylerjereddy commented 1 year ago

I'll add a few review comments now that Yaris has pushed up a first attempt at working with pytest and the package more broadly.

There are a few things to address here:

1) There's no good reason to have two copies of the glob_feature.py module in the same branch/pull request--they aren't even the same version of the code, which is extra confusing. Which location is correct? Well, there are currently no Python library modules (only setup.py, which is a special install file) in the darshan project root, while every other Python library module resides at a deeper level of nesting:

pydarshan
├── benchmarks
│   └── benchmarks
│       ├── __init__.py
│       ├── dxt_heatmap.py
│       ├── log_utils.py
│       └── report.py
├── darshan
│   ├── __init__.py
│   ├── __main__.py
│   ├── backend
│   │   ├── __init__.py
│   │   ├── api_def_c.py
│   │   └── cffi_backend.py
│   ├── cli
│   │   ├── __init__.py
│   │   ├── __main__.py
│   │   ├── info.py
│   │   ├── name_records.py
│   │   ├── summary.py
│   │   └── to_json.py
│   ├── datatypes
│   │   ├── __init__.py
│   │   └── heatmap.py
│   ├── discover_darshan.py
│   ├── examples
│   │   ├── __init__.py
│   │   ├── darshan-graph
│   │   ├── example_logs
│   │   │   └── __init__.py
│   │   └── tutorial
│   │       ├── hello.py
│   │       ├── plot.py
│   │       └── tojson.py
│   ├── experimental
│   │   ├── __init__.py
│   │   ├── aggregators
│   │   │   ├── __init__.py
│   │   │   ├── agg_ioops.py
│   │   │   ├── create_dxttimeline.py
│   │   │   ├── create_sankey.py
│   │   │   ├── create_time_summary.py
│   │   │   ├── create_timeline.py
│   │   │   ├── mod_agg_iohist.py
│   │   │   ├── name_records_summary.py
│   │   │   ├── print_module_records.py
│   │   │   ├── records_as_dict.py
│   │   │   └── summarize.py
│   │   ├── operations
│   │   │   ├── filter.py
│   │   │   ├── merge.py
│   │   │   └── reduce.py
│   │   ├── plots
│   │   │   ├── __init__.py
│   │   │   ├── data_access_by_filesystem.py
│   │   │   ├── heatmap_handling.py
│   │   │   ├── plot_access_histogram.py
│   │   │   ├── plot_common_access_table.py
│   │   │   ├── plot_dxt_heatmap.py
│   │   │   ├── plot_dxt_heatmap2.py
│   │   │   ├── plot_io_cost.py
│   │   │   ├── plot_opcounts.py
│   │   │   └── plot_posix_access_pattern.py
│   │   └── transforms
│   │       └── dxt2png.py
│   ├── lib
│   │   └── accum.py
│   ├── log_utils.py
│   ├── report.py
│   └── tests
│       ├── __init__.py
│       ├── conftest.py
│       ├── input
│       │   └── __init__.py
│       ├── test_cffi_misc.py
│       ├── test_data_access_by_filesystem.py
│       ├── test_error.py
│       ├── test_heatmap_handling.py
│       ├── test_lib_accum.py
│       ├── test_log_utils.py
│       ├── test_moddxt.py
│       ├── test_modmpiio.py
│       ├── test_modposix.py
│       ├── test_modstdio.py
│       ├── test_plot_common_access_table.py
│       ├── test_plot_dxt_heatmap.py
│       ├── test_plot_exp_common.py
│       ├── test_plot_io_cost.py
│       ├── test_report.py
│       ├── test_report_copy.py
│       └── test_summary.py
├── devel
├── docs
│   └── conf.py
└── setup.py

So, the module at path darshan-util/pydarshan/darshan/glob_feature/glob_feature.py seems like a more sensible choice. Let's try making that change on this branch and then running the test you added:

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    deleted:    glob_feature.py

python -m pytest --pyargs darshan -k "test_glob_tables"

Now we get the same failure that you can see in the CI after you pushed up. When you see failures in the CI like that, it usually means there's a problem with your pull request/code:

============================================================================================= ERRORS ==============================================================================================
___________________________________________________________________________ ERROR collecting tests/test_glob_feature.py ___________________________________________________________________________
/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/tests/test_glob_feature.py:9: in <module>
    print(sys.path)  # Print sys.path again
E   NameError: name 'sys' is not defined
----------------------------------------------------------------------------------------- Captured stdout -----------------------------------------------------------------------------------------
2.0.1
===================================================================================== short test summary info =====================================================================================
ERROR tests/test_glob_feature.py - NameError: name 'sys' is not defined
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================================================ 532 deselected, 1 error in 0.67s =================================================================================

Ok, so let's fix that error and try again:

(py_311_darshan_dev) treddy@pn2305118 pydarshan % git diff
diff --git a/darshan-util/pydarshan/darshan/tests/test_glob_feature.py b/darshan-util/pydarshan/darshan/tests/test_glob_feature.py
index 8a779857..b26cf309 100644
--- a/darshan-util/pydarshan/darshan/tests/test_glob_feature.py
+++ b/darshan-util/pydarshan/darshan/tests/test_glob_feature.py
@@ -6,10 +6,8 @@ print(pd.__version__)
 from pandas.testing import assert_frame_equal
 import pytest
 import re 
-print(sys.path)  # Print sys.path again
 import glob_feature

-print("hello")
 @pytest.mark.parametrize("log_name, expected_df", [
      # grow this with more logs...
      ("e3sm_io_heatmap_only.darshan",

Now, we run the test again and get another failure--it looks like you're still having issues with Python packaging because the import doesn't work:

___________________________________________________________________________ ERROR collecting tests/test_glob_feature.py ___________________________________________________________________________
ImportError while importing test module '/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/tests/test_glob_feature.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/opt/homebrew/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/tests/test_glob_feature.py:9: in <module>
    import glob_feature
E   ModuleNotFoundError: No module named 'glob_feature'

Well, the import works for every single other module under test in the project, so why not just study one of them as an example?

tree darshan/experimental -P "*init*.py"

darshan/experimental
├── __init__.py
├── __pycache__
├── aggregators
│   ├── __init__.py
│   └── __pycache__
├── operations
│   └── __pycache__
├── plots
│   ├── __init__.py
│   └── __pycache__
└── transforms

It looks like those __init__.py files might be useful, like I mentioned a few times in our discussions on Slack. I don't see one in your new path:

tree darshan/glob_feature

darshan/glob_feature
└── glob_feature.py

Ok, let's add that in, so:

new file:   darshan/glob_feature/__init__.py

Try the testing again, still get ModuleNotFoundError: No module named 'glob_feature'. Well, that makes sense, glob_feature isn't some external package, it is in our darshan package now that you've made the packaging changes to move it in, so let's adjust the import based on what we do for other modules in our package:

git grep -E -i "import heatmap"

from darshan.experimental.plots import heatmap_handling

So, for us, maybe something like:

--- a/darshan-util/pydarshan/darshan/tests/test_glob_feature.py
+++ b/darshan-util/pydarshan/darshan/tests/test_glob_feature.py
@@ -6,7 +6,7 @@ print(pd.__version__)
 from pandas.testing import assert_frame_equal
 import pytest
 import re 
-import glob_feature
+from darshan.glob_feature import glob_feature

 @pytest.mark.parametrize("log_name, expected_df", [
      # grow this with more logs...

And now the test will actually run--of course it fails, and we've already discussed this failure before:

>       style = style.hide_index()
E       AttributeError: 'Styler' object has no attribute 'hide_index'

Part of the confusion may be that you had two copies of the module that weren't synced (that's always a terrible idea!), or maybe you copy-pasted something without genuinely understanding it, but in any case, this should help you move forward quite a bit, since if your local env is setup probably, the collection of changes above should allow you to run the test with:

python -m pytest --pyargs darshan -k "test_glob_tables"

There a quite a few other issues in the code, and one is so major that I'll add an inline comment right away, but the others I'm hoping you'll study carefully and clean up a bit.

tylerjereddy commented 1 year ago

I'll wait for another round of cleanups before commenting again--there's a lot of changes I suggested, and having two separate copies of the new code module in the same branch is basically insanity to review, so I won't try to guess what you're doing there. But hopefully I've given you a boost here at least...

tylerjereddy commented 1 year ago

Possible checklist for updating the PR might be something like:

yariseidenbenz commented 1 year ago

Changes: I got rid of the glob_feature.py script located in the /vast/home/yaris/darshan/darshan-util/pydarshan/darshan

So now the only glob_feature.py script is located in darshan/darshan-util/pydarshan/darshan/glob_feature.

I fixed the styling issues

I also added a few new lines of code which iterates over the list of paths within a certain group finds differences and replaces those differences with [.*]. These new paths are then represented within the data frame.

Here are some examples:

This is the test for log file in apmpi/darshan-apmpi-2nodes-64mpi.darshan

Screenshot 2023-06-19 at 11 25 41 AM

This is the test for log file in hdf5_diagonal_write_only/hdf5_diagonal_write_1_byte_dxt.darshan

Screenshot 2023-06-19 at 11 28 53 AM

This is the test for log file partial_data_stdio/partial_data_stdio.darshan

Screenshot 2023-06-19 at 12 34 28 PM

Questions: The paths are correct and are condensed but the [.] for each differing character is long. Should I restrict the amount of uses of [.]?

The ratio method in difflib library is what is used to group file paths together depending on if they are 80 % similar. This method does work but might not be the best way to do it. Filepaths might be very similar up to the 80 % ratio but afterwards have big differences. The matcher.ratio() method could be considered a weakness because it is not programatic enough. Here is an example with the log file partial_data_stdio/partial_data_stdio.darshan Without condensing

Screenshot 2023-06-19 at 12 58 19 PM

With condensing

Screenshot 2023-06-19 at 12 34 28 PM

As you can see the file paths are the same up until the last directory. With the current glob_feature.py script it would group everything together because of the similarity score and then add [.] for each differing character. However in this test cause it would probably be better to have two file paths displayed /home/carns/working/dbg/darshan-examples/test.out /home/carns/working/dbg/darshan-examples/foo[.][.][.]

Is what I have currently acceptable or would you want the output above? You gave me some resources on ndiff method in the difflib library which might help solve the issue and there are also machine learning clustering algorithms we could use.

Should I restrict the count of hdf files? These are the hierarchal files that usually look something like

image

They usually end Dataset such as in the case above. In cases where there are many hdf dataset files, should I restrict their count because from my understanding aren't they technically one file?
Like if there is /path/to/file/Datatset:000
/path/to/file/Datatset:001 Should the glob_count for both of these files be one because they are the same file or should the both have a combined glob_count of two?

To do:

tylerjereddy commented 1 year ago

Looks like a step in the right direction now.

Should I restrict the amount of uses of [.]?

Yes, you should only have one [.*] per region that has a difference, irrespective of the size of the difference, but all flanking matches should be explicit and not replaced, I'm pretty sure that is what Shane meant. That's also the whole point of our meetings--to make sure you are 100 % certain what the team wants--if it isn't 100 % clear what you should do for a given input file (and some of the files above are repeated from the meeting), you should try to clear that up in the meetings, but doing it here is the second best way at least!

partial_data_stdio.darshan should probably have: /home/carns/working/dbg/darshan-examples/foo[.*] and /home/carns/working/dbg/darshan-examples/test.out. However, if all the foo.. files actually start with 6, then foo6[.*] should probably be used to show the exact point where they differ vs. the same.

As for algorithms, I only sent you the ndiff as an example, the point is to carefully compare the various combinations of strings and compare them with a higher degree of granularity than just the similarity ratio. I don't see the pytest testing expanded for the new files you've discussed above, so maybe start to expand the parametrized test for those cases.

Also, this is still broken on your branch, so you should try to fix it and add a regression test using pytest similar to what we discussed the other day for the main HTML reports that also get tested for their command line incantation:

python glob_feature.py -p ~/github_projects/darshan-logs/darshan_logs/e3sm_io_heatmaps_and_dxt/e3sm_io_heatmap_only.darshan

Traceback (most recent call last):
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/glob_feature/glob_feature.py", line 95, in <module>
    main(log_path=args.log_path, output_path=args.output_path)
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/glob_feature/glob_feature.py", line 86, in main
    with open(output_path, "w") as html_file:
         ^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneType

The incantation with -o does work at least, so that's good, though you should add a test for that incantation as well, to make sure it doesn't break as you make changes.

Finally, the CI is failing for this PR, so you should try to add in the appropriate dependency for now, so that we can see your tests passing or failing appropriately.

tylerjereddy commented 1 year ago

In case it wasn't clear from the previous meeting, here is some guidance on the approach to testing/vision for moving forward:

1) It makes sense to continue expanding test_glob_tables with more cases from darshan-logs, which you already appear to be working on, making sure there's a reasonable consensus on the expected values from the team. 2) It also makes sense to come up with your own test cases, for example drawing out your own file tree and then testing for the appropriate clustering/globbing behavior on those data structures. This is likely to be a separate test, since the input is the data structure that gets extract from the log rather than the log proper, as discussed at the last meeting. The advantage is that you don't have to generate logs this way, so darshan-logs doesn't fill up too much just for this one thing. Creativity in designing these test cases may mean that you have filepath trees that differ in 1, 2, 3, 4, different locations (start, middle, end of file paths), or have tons of files that aren't related, etc. 3) Once you've got the above tests, with agreement from the team on expected values, encoded into pytest test cases, you can start to worry about the algorithm that will accomplish the clustering/globbing task. If you genuinely don't think difflib and the Python standard library won't suffice, it will be important to clearly communicate why that is, and exactly what capability you'll need. As it stands, it looks to me like a combination of difflib and careful manual algorithm design may do the trick, but if there is a clear and concise explanation of why the plethora of capabilities in difflib and the Python standard library don't suffice, and you can show the tests passing for the large number of diverse cases described above, I suspect we might be swayed to deviate form the standard library at that point.

At the end of the project, even if the final product from the summer is basically just a bunch of agreed-upon but very well crafted test cases, that's already considerable value delivered, since the task of identifying exactly what folks want is a big part of the challenge of a software project, and encoding that in tests will make it easier to move forward.

yariseidenbenz commented 11 months ago

Description of Changes: I have made significant improvements to the file path grouping process by adopting agglomerative hierarchical clustering from the scikit-learn library. This method is more efficient than the previous difflib sequence matcher approach. Moreover, the number of clusters is no longer fixed; it is now determined dynamically using the silhouette method. This dynamic clustering capability allows the script to handle various test cases of different sizes more effectively.

Additionally, I have expanded the test_glob_feature.py to include more log files from the darshan log repository. This expansion ensures a more comprehensive testing process, covering a broader range of scenarios and data.

I've added the necessary dependencies to the main_ci.yml.

Next Steps: Create a verbose option to display all the file paths if the user wishes to.

Moreover, I am currently exploring ways to display more commonalities within the representing file path. While the current implementation effectively finds differences and replaces them with [.*], there is room for improvement in handling commonalities at the end of file paths. For example, if multiple file paths have common elements at their endings but different positions, these commonalities are currently not accounted for in the representing file path. To address this, I am considering indexing from the back to the front and concatenating the common elements to the merged_path, which would create a more informative and accurate representing file path.

Here is an example of the issue: Say this is a group: path/to/file000.nc path/to/file9999000.nc path/to/file8888888000.nc

The reprsenting filepath would be path/to/file[.*].nc glob count: 3

Even though the file paths all contain 000 in the last directory. To me it would make sense to make the ideal representing path: path/to/file[.*]000.nc glob count: 3

Any insight on how to approach this problem?

Here are the lines of code that deal with switching differing characters within the group with [.*] and finds if the paths within a group have a common file extensions: # these lines create the representing file path for a group

    new_paths = []
    for _, group in grouped_paths.items():
        if len(group) > 1:
            merged_path = ""
            max_length = max(len(path) for path in group)
            differing_chars_encountered = False
            common_extension = None

            for i in range(max_length):
                chars = set(path[i] if len(path) > i else "" for path in group)
                if len(chars) == 1:
                    merged_path += chars.pop()
                    differing_chars_encountered = True
                else:
                    if differing_chars_encountered:
                        merged_path += "[.*]"
                        differing_chars_encountered = False

            # Checks if all paths have the same file extension
            extensions = [os.path.splitext(path)[1] for path in group]
            common_extension = None
            if len(set(extensions)) == 1:
                common_extension = extensions[0]

            # Append the common extension if it exists and it's not already in the merged_path
            if common_extension and common_extension not in merged_path:
                merged_path += common_extension

            new_paths.append((merged_path, len(group)))
        else:
            new_paths.append((group[0], 1))
yariseidenbenz commented 11 months ago

Here are the file paths for acme_fcase_trace log file:

Screenshot 2023-07-21 at 3 11 35 PM

Here are the 3 file paths that are being grouped together in the last row under the representing file path name (/projects/ccsm/inputdata/[.]n.[.].[.]1[.]0[.*].nc)

/projects/ccsm/inputdata/share/domains/domain.lnd.ne120np4_gx1v6.110502.nc /projects/ccsm/inputdata/ocn/docn7/domain.ocn.1x1.111007.nc /projects/ccsm/inputdata/share/domains/domain.ocn.ne120np4_gx1v6.121113.nc

It would probably be more beneficial to display them individually rather than grouped in this case.

shanedsnyder commented 11 months ago

Here are the 3 file paths that are being grouped together in the last row under the representing file path name (/projects/ccsm/inputdata/[.]n.[.].[.]1[.]0[.*].nc)

/projects/ccsm/inputdata/share/domains/domain.lnd.ne120np4_gx1v6.110502.nc /projects/ccsm/inputdata/ocn/docn7/domain.ocn.1x1.111007.nc /projects/ccsm/inputdata/share/domains/domain.ocn.ne120np4_gx1v6.121113.nc

It would probably be more beneficial to display them individually rather than grouped in this case.

I think most of this table looks good, but I have a different thought on what might work best for the issue above. I'd actually recommend either only:

/projects/ccsm/inputdata/[.*].nc

Or the list of:

/projects/ccsm/inputdata/atm/[.*].nc
/projects/ccsm/inputdata/ocn/[.*].nc
/projects/ccsm/inputdata/share/[.*].nc

Does that simplify anything or make things more complicated? I think either of those options are more helpful than the current output.

Note, that what we decide there could also affect what we decide to do with these 2 further up:

/projects/ccsm/inputdata/atm/cam/physprops/[.*].nc
/projects/ccsm/inputdata/atm/cam/[.*].nc

There's also a couple of examples that are maybe more complex than they need to be visually (I know that's not always easy to square with what the algorithm is doing):

test_F_case_cetus_dxt.[.*]00[.*]
run/timing/[.*]i[.*]

In both cases, there's a couple of groups of "match anything" blocks ([.*]) separated by one or 2 characters in between them. I think they'd look better if they were just collapsed down into a single [.*] (throwing out the additional context of the '00' or the 'i'). Not sure if that's easy to accomplish, but maybe you avoid having such a short run of characters between two [.*] blocks?

yariseidenbenz commented 11 months ago
Screenshot 2023-07-31 at 2 15 21 PM

@shanedsnyder Here is the same test case (acme_fcase_trace) but after it groups the common prefixes together it then groups by file extensions.

Would this be desired output?

Here is another example with the hdf5_diagonal_write_1_byte test case:

Screenshot 2023-07-31 at 2 40 17 PM
shanedsnyder commented 11 months ago

Yeah, I find those to be a lot more helpful personally, but maybe other folks have a different thought.

My only additional comment is for the first example you shared, I notice that all of the files that don't match a glob (i.e., count of 1, no wildcard) do all share a common prefix that's pretty deep into the path name -- I wonder if in that case if we should instead glob them all into one /whatever/run/[.*] glob? I think it'd look cleaner visually at least to have those further grouped if it's not too complicated or seemingly arbitrary or something like that?

Thinking out loud a little more, regardless of what we decide above, it might be nice if there was like a "verbose" option to the tool that could spit out the entire list of all matching files in each glob (maybe indented right beneath the glob patterns in the tables you already have)? That way even if the tool doesn't group things in a way that's super helpful for users, they can still check and see what all files got matched to each glob, if that makes sense?