PDS4 Indexing Tools Improvements and Additions

esimpsons3ti commented 2 months ago

The indexing tool (now pds4_create_xml_index.py) now has six new argparse commands:

--output-headers-file
--limit-xpaths-file (previously --elements-file)
--simplify-xpaths (previously --disambiguate-xpaths)
--fixed-width
--generate-label
--label-user-input

Major changes have been implemented into the code to make the result more human-readable and to fix several major bugs. Some of these issues include missing indexed data, inaccurate XPath headers, and Windows incompatibility.
Added unit tests and associated supplemental files, which as of 7/5 all pass:
- tests/ contains the unit tests for the indexing tool
- test_files/expected contains golden copies for unit test comparison
- test_files/samples contains supplemental files for the unit tests to use
- test_files/labels contains fake label files for the unit tests to use
Docstrings and internal documentation has been fully incorporated into the indexing tool, and has also been reformatted to follow Google style.

There is also the implementation of the label generation feature, which includes:

label generation functions
template label file
default input file (called default_values.yaml)

Fixes #3
Fixes #5
Fixes #6
Fixes #8

rfrenchseti commented 2 months ago

I think maybe --output-file should be a required argument. Right now it defaults to putting the resultant CSV file in the root directory of the bundle/collection, and calling it index_file.csv. This is almost certainly the wrong thing to do. First of all, people probably don't want to modify their archives unless they know the result is good. Second they are unlikely to want to just call it index_file.csv. And if you don't know what the default is, the file is actually really hard to find. So maybe we should just force people to specify the filename/path.

rfrenchseti commented 2 months ago

Here's another thought. Right now --output-file is generic - it will produce a CSV file or a TXT file full of elements. Maybe it would make more sense to have different command line options for different outputs, like --output-csv-file and --output-xpaths-file and then get rid of --dump-available-xpaths as an option. Then you could actually output both a CSV file and an XPath file at the same time, which could be handy. If you don't specify any output file, print a message and exit.

rfrenchseti commented 2 months ago

The test coverage is only 85%, and if you look at the coverage HTML report you'll see a lot of code that really needs to be tested. It also shows some code that might be obsolete and never used anymore, in which case you should delete it. I think this program should be at 100% test coverage. It won't take that much effort to get there.

codecov[bot] commented 2 months ago

Welcome to Codecov :tada:

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered :open_umbrella:

matthewtiscareno commented 2 months ago

It seems that, if there is an attribute that appears in some of the indicated product labels but not others, then the column will be created and the value of "NaN" will be entered in that column for those product labels that do not include the attribute. Is this intended?

rfrenchseti commented 2 months ago

It seems that, if there is an attribute that appears in some of the indicated product labels but not others, then the column will be created and the value of "NaN" will be entered in that column for those product labels that do not include the attribute. Is this intended?

As a side note, it's a very bad idea to create index files from labels that have dramatically different layouts. The resulting index will just have a lot of empty columns and won't make much sense. This is why, for example, we create three different index files for uranus_occs_earthbased - even though rings, global, and atmosphere have a lot in common, they are different enough that having a single index file is just confusing. I'm mentioning this here mainly so that @esimpsons3ti will remember to put this as a big important note in the documentation she will write.

That said, there are invariably small changes between labels in the same data set. For example, there might be a different number of keywords, or SPICE kernels, or auxiliary products. So we need to be able to handle those cases.

The question is: What do we put in the CSV file when the data is "missing"? The current code puts in NaN for both string and numeric fields, which I agree is not great. But what should we put in? I would argue that maybe the fields should just be empty. Alternatively we could make it used-configurable.

esimpsons3ti commented 2 months ago

A few notes about this:

This behavior was due to an element within the batch of tested label files being nilled when they weren't nillable (in this case, Complex_FOV). Previously, I didn't have code that would go in and look for the data type of the erroneously-nilled element. It would just print a warning and move on.
Now that I do have this code, I have integrated it into store_element_text. Now, if a non-nillable element is nilled, it will still print the warning, but there will no longer be NaNs placed where the default value is.

rfrenchseti commented 2 months ago

I don't think this is the only time this can happen. I produced NaNs myself using the Uranus occ bundles.

matthewtiscareno commented 2 months ago

The Cassini VIMS bundle contains a number of "composite structures" that are broken up into individual data products. The data products that are part of a composite structure have an entire additional class full of attributes that describes how they relate to the rest of the structure. For example, in this directory, the data products ending with _xxx have that additional class, while those that don't don't.

It is quite reasonable to want an index file that includes all of these products together, which means that the case arises of columns that are present in some labels but absent from others.

Do we need to inquire into what is the data type of each column in order to define the special constants for when they are missing?

esimpsons3ti commented 2 months ago

flake8 has been failing the code on line length, even when my flake8.txt file is set to the 90 character limit. But I'll wait on your feedback for now.

matthewtiscareno commented 2 months ago

I ran the tool a couple days ago. It initially didn't work, but @esimpsons3ti sat down with me and diagnosed the problem. It was this line in the process_schema_location function: tree = etree.parse(file_path) For some reason, the way I configured the VSCode JSON was inputting file_path as a PosixPath rather than as a string, which broke the code. We fixed it by simply changing that line so that it read tree = etree.parse(str(file_path)) I'm not sure whether this fix, or a different fix for that situation, has been added to the branch.

Once this was done, the tool ran successfully for me and produced an index file. However, I have other tests I'd like to run, and I just now came back to it. However, I now find that, at this point, I cannot run the tool at all because yesterday's changes seem to have introduced a requirement of pdstemplate. I have the rms-pdstemplate repo on my machine, but I do not know how to get this tool to find it. This might be an occasion for me to improve my usage of VSCode and maybe even virtual machines (which I do not currently use but probably should).

More generally, is there documentation that we would expect a novice user to read that would enable them to run the program? I see the docstring, but it doesn't address questions like what software must you have installed or what commands you must execute in order to be able to run the tool successfully.

esimpsons3ti commented 2 months ago

@matthewtiscareno I understand that the docstrings within the program are not sufficient for a user to go off of. The current plan is to save creating the README file for last. However, I do intend on making a draft of it today.

esimpsons3ti commented 2 months ago

@rfrenchseti Correct, there are currently no tests for label generation. That will be today's project, alongside general unit test coverage.

rfrenchseti commented 2 months ago

I'm looking at the --help with the eyes of a brand new user.

I'm finding these confusing:
```
--output-csv-file OUTPUT_CSV_FILE
                    Specify the location and filename of the index file.
--output-txt-file OUTPUT_TXT_FILE
                    Specify the location and filename of the XPath headers file.
```
Why is the option to specify the index file called --output-csv-file instead of --output-index-file? Why is the option to specify the headers file called --output-txt-file instead of --output-headers-file? Also the capitalized string should probably be something like FILENAME to make it clear what you specify as an argument. If you just duplicate the name of the option it's confusing.
--elements-file doesn't really specify elements, right? An element is a pair of tags plus the thing between the tags, like <a>value</a>. What you're actually specifying is full XPaths to scrape. So maybe it should be called something like --limit-xpaths-file?
Similar to above, --sort-by SORT_BY should say something useful instead of SORT_BY, like HEADER_NAME.

In

--add-extra-file-info {LID,filename,filepath,bundle_lid,bundle} [{LID,filename,filepath,bundle_lid,bundle} ...]
                    Add additional columns to the final index file. Choose from the following: "LID", "filename", "filepath", "bundle_lid", and "bundle". If specifying multiple column
                    names, supply them as separate arguments separated by spaces

you don't need the "Choose from" sentence, because the options are already listed above. Also there should be a period at the end.

For

--config-file CONFIG_FILE
                    Read a user-specified configuration file. File must be a .ini file.

It's not true the file has to be a .ini file. I think you want to say "The file must be in .ini file format."

For
```
--dump-available-xpaths
                    Give a .txt file of all XPaths within the given label files. This file can be used as a base file for --elements-file.
```
"Give" is ambiguous. Maybe say "Write". Does this go along with --output-txt-file? If so, it would be a lot easier to have a single command line option --dump-available-xpaths FILENAME. If it's present, do the dump to the given file. If it's not present, don't dump anything. Also, can you clarify whether these XPaths are before or after simplification? I think they need to be before, because otherwise you can't use them in the "elements" file if you have --simplify-xpaths turned on.

For

--generate-label {Product_Ancillary,Product_Metadata_Supplemental}
                    Generate a PDS4 label for the generated index file. Can generate either a Product_Ancillary or Product_Metadata_Supplemental label.

what is the filename of the resulting label?

For

--label-user-input LABEL_USER_INPUT
                    Provide an optional .yaml file containing additional information for the generated label.

again it doesn't have to be a file with the extension .yaml, it just has to be in YAML format.

I would add an argparse epilog that refers people to the online documentation for more details.
Why is there both a "config file" and a "label user input" file? Wouldn't it be easier to put the label info into the config file? Or do we expect the user to want to keep those separate?
The order of the options as displayed is important to the user's understanding. In fact, I might group them together using argparse's "argument_group" concept. The groupings would be, in this order, with the options in this order (open to discussion):
- Index file generation:
- --output-csv-file
- --add-extra-file-info
- --sort-by
- --fixed-width
- --clean-header-field-names
- --simplify-xpaths
- Limiting XPaths:
- --elements-file
- --dump-available-xpaths
- --output-txt-file
- Label file generation:
- --generate-label
- --label-user-input
- Miscellaneous:
- --config-file
- --verbose

rfrenchseti commented 2 months ago

Now that I look at the code, it seems like --dump-available-xpaths is not actually used for anything anymore and that you already did the change to make --output-txt-file do all the work. So you just have to remove the dead argument.

rfrenchseti commented 2 months ago

I think in the -add-extra-file-info option, we shouldn't capitalize LID since we aren't capitalizing bundle_lid.

rfrenchseti commented 2 months ago

If you specify both --output-txt-file and --output-csv-file, you only get the txt file. It silently fails to write the csv file, which is really confusing, especially if you already had a csv file there that you were trying to update (I spent the last 10 minutes trying to figure out why my csv file had the wrong contents). I see no reason that we can't allow writing both the csv and the header file in the same run of the program.

rfrenchseti commented 2 months ago

--simplify-xpaths is changing the order of the header fields. It should keep them the same. See the attached two files as examples. rings.csv rings_simp.csv

rfrenchseti commented 2 months ago

Separating the arguments to --add-extra-file-info and --sort-by as separate arguments means that argparse will fail to parse the command line in some cases. For example:

❯ python pds4_create_xml_index.py --add-extra-file-info filepath filename LID /mnt/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased/uranus_occ_u0_kao_91cm "**/data/rings/*0m.xml" --verbose --generate-label Product_Ancillary
usage: pds4_create_xml_index.py [-h] [--output-csv-file OUTPUT_CSV_FILE] [--output-txt-file OUTPUT_TXT_FILE] [--elements-file ELEMENTS_FILE] [--simplify-xpaths] [--verbose]
                                [--sort-by SORT_BY [SORT_BY ...]] [--clean-header-field-names]
                                [--add-extra-file-info {LID,filename,filepath,bundle_lid,bundle} [{LID,filename,filepath,bundle_lid,bundle} ...]] [--config-file CONFIG_FILE]
                                [--dump-available-xpaths] [--fixed-width] [--generate-label {Product_Ancillary,Product_Metadata_Supplemental}] [--label-user-input LABEL_USER_INPUT]
                                directorypath patterns [patterns ...]
pds4_create_xml_index.py: error: argument --add-extra-file-info: invalid choice: '/mnt/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased/uranus_occ_u0_kao_91cm' (choose from 'LID', 'filename', 'filepath', 'bundle_lid', 'bundle')

This happens whenever one of the arguments that takes multiple values separated by spaces is followed by the required directory name and patterns. This will be confusing to the user, because we explicitly say the directory name and patterns should come last. In fact, I think they should be able to be provided in any order. The only fix for this is to change the behavior of --add-extra-file-info and --sort-by. I think they should each take a comma-separated list of items instead of a + set of arguments. Then just do a split(',') on the values to get the argument values. You'll have to write your own validation function now, since argparse won't be able to do it. But this is easy to do. Here's an example:

def _validate_planet(planet):
    """Routine for argparse to validate a planet."""
    if not planet in ("jupiter", "saturn", "uranus", "neptune"):
        raise argparse.ArgumentTypeError(
             f"{planet} is not a valid planet - must be one of "
              "jupiter, saturn, uranus, neptune")
    return planet

group.add_argument(
    "--planet", default="saturn",
    type=_validate_planet,
    help=f"""Which planet to process: jupiter, saturn, uranus, neptune
             (saturn is the default)""")

rfrenchseti commented 2 months ago

If you don't specify any output path, you get the error:

Output path not chosen.

First of all, you can print an error for this long before you scrape all the labels. Right now it does all the label scraping (which can take a long time) and then complains. But the use of passive voice and "chosen" is confusing. I would suggest something like "You must specify at least one of: --output-csv-file or --output-txt-file" etc.

rfrenchseti commented 2 months ago

Generating a label file is crashing for me:

❯ python pds4_create_xml_index.py /mnt/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased/uranus_occ_u0_kao_91cm "**/data/rings/*0m.xml" --generate-label Product_Ancillary --output-csv-file rings2.csv
Traceback (most recent call last):
  File "/seti/all_repos/rms-pds4indextools/pds4indextools/pds4_create_xml_index.py", line 1283, in <module>
    main()
  File "/seti/all_repos/rms-pds4indextools/pds4indextools/pds4_create_xml_index.py", line 1217, in main
    true_type = find_base_attribute(xsd_tree, name)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/seti/all_repos/rms-pds4indextools/pds4indextools/pds4_create_xml_index.py", line 783, in find_base_attribute
    target_attribute_value = target_attribute_values[0]
                             ~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

rfrenchseti commented 2 months ago

When you dump the elements file with --output-txt-file while using --simplify-xpaths, you get the simplified XPaths twice, once in their simplified form and once in their original form. See the attached file. Example (excerpt from full file):

pds:logical_identifier
pds:title
pds:information_model_version
pds:Product_Observational/pds:Identification_Area<1>/pds:logical_identifier<1>
pds:Product_Observational/pds:Identification_Area<1>/pds:title<1>
pds:Product_Observational/pds:Identification_Area<1>/pds:information_model_version<1>

generated with:

python pds4_create_xml_index.py /mnt/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased/uranus_occ_u0_kao_91cm "**/data/rings/*0m.xml" --simplify-xpaths --output-txt-file rings2_simp.txt

rings2_simp.txt

matthewtiscareno commented 2 months ago

If you don't specify any output path, you get the error:
Output path not chosen.
First of all, you can print an error for this long before you scrape all the labels. Right now it does all the label scraping (which can take a long time) and then complains. But the use of passive voice and "chosen" is confusing. I would suggest something like "You must specify at least one of: --output-csv-file or --output-txt-file" etc.

If the user doesn't specify an output path, wouldn't it be better to write to a default output path, rather than for the tool to crash?

rfrenchseti commented 2 months ago

If you don't specify any output path, you get the error:
Output path not chosen.
First of all, you can print an error for this long before you scrape all the labels. Right now it does all the label scraping (which can take a long time) and then complains. But the use of passive voice and "chosen" is confusing. I would suggest something like "You must specify at least one of: --output-csv-file or --output-txt-file" etc.
If the user doesn't specify an output path, wouldn't it be better to write to a default output path, rather than for the tool to crash?

Previously if you didn't specify an output file, it created a file with a name like index.csv in the root directory of the bundles/collections. I argued that this was a bad default, because it was writing into the (possibly read-only, and at least archival) directory structure and thus modifying something the user wasn't expecting to be modified. I think it would be perfectly fine to write into a default filename as long as 1) the filename was in the user's current directory, not the bundle root, 2) the program printed a message saying what filename it was writing and, and maybe 3) didn't overwrite a file that was already there.

matthewtiscareno commented 2 months ago

I think it would be perfectly fine to write into a default filename as long as 1) the filename was in the user's current directory, not the bundle root, 2) the program printed a message saying what filename it was writing and, and maybe 3) didn't overwrite a file that was already there.

I totally agree. I might accomplish your third point by first checking whether index.csv already exists in the user's current directory. If it does, then change the output name to index1.csv. If that also already exists, then go on to index2.csv, and so forth.

esimpsons3ti commented 2 months ago

This will be confusing to the user, because we explicitly say the directory name and patterns should come last.

Where is that said? They're listed first in the examples we gave.

esimpsons3ti commented 2 months ago

Generating a label file is crashing for me:

❯ python pds4_create_xml_index.py /mnt/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased/uranus_occ_u0_kao_91cm "**/data/rings/*0m.xml" --generate-label Product_Ancillary --output-csv-file rings2.csv
Traceback (most recent call last):
  File "/seti/all_repos/rms-pds4indextools/pds4indextools/pds4_create_xml_index.py", line 1283, in <module>
    main()
  File "/seti/all_repos/rms-pds4indextools/pds4indextools/pds4_create_xml_index.py", line 1217, in main
    true_type = find_base_attribute(xsd_tree, name)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/seti/all_repos/rms-pds4indextools/pds4indextools/pds4_create_xml_index.py", line 783, in find_base_attribute
    target_attribute_value = target_attribute_values[0]
                             ~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

This has been fixed with changes to --simplify-xpath's behavior.

rfrenchseti commented 2 months ago

This will be confusing to the user, because we explicitly say the directory name and patterns should come last.

Where is that said? They're listed first in the examples we gave.

It's in the --help output. I don't know why it's last there, since you put it first in your list of arguments, but I guess it just lists unnamed arguments at the end by default. Still, I think having multiple values be separated by spaces is confusing. That's not something command line programs normally do for multiple options - they have them separated by commas. If we just changed that convention, it would solve this problem entirely.

usage: pds4_create_xml_index.py [-h] [--output-csv-file OUTPUT_CSV_FILE] [--output-txt-file OUTPUT_TXT_FILE] [--elements-file ELEMENTS_FILE] [--simplify-xpaths] [--verbose]
                                [--sort-by SORT_BY [SORT_BY ...]] [--clean-header-field-names]
                                [--add-extra-file-info {LID,filename,filepath,bundle_lid,bundle} [{LID,filename,filepath,bundle_lid,bundle} ...]] [--config-file CONFIG_FILE]
                                [--dump-available-xpaths] [--fixed-width] [--generate-label {Product_Ancillary,Product_Metadata_Supplemental}] [--label-user-input LABEL_USER_INPUT]
                                directorypath patterns [patterns ...]

esimpsons3ti commented 2 months ago

Separating the arguments to --add-extra-file-info and --sort-by as separate arguments means that argparse will fail to parse the command line in some cases. For example:
❯ python pds4_create_xml_index.py --add-extra-file-info filepath filename LID /mnt/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased/uranus_occ_u0_kao_91cm "**/data/rings/*0m.xml" --verbose --generate-label Product_Ancillary
usage: pds4_create_xml_index.py [-h] [--output-csv-file OUTPUT_CSV_FILE] [--output-txt-file OUTPUT_TXT_FILE] [--elements-file ELEMENTS_FILE] [--simplify-xpaths] [--verbose]
                                [--sort-by SORT_BY [SORT_BY ...]] [--clean-header-field-names]
                                [--add-extra-file-info {LID,filename,filepath,bundle_lid,bundle} [{LID,filename,filepath,bundle_lid,bundle} ...]] [--config-file CONFIG_FILE]
                                [--dump-available-xpaths] [--fixed-width] [--generate-label {Product_Ancillary,Product_Metadata_Supplemental}] [--label-user-input LABEL_USER_INPUT]
                                directorypath patterns [patterns ...]
pds4_create_xml_index.py: error: argument --add-extra-file-info: invalid choice: '/mnt/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased/uranus_occ_u0_kao_91cm' (choose from 'LID', 'filename', 'filepath', 'bundle_lid', 'bundle')
This happens whenever one of the arguments that takes multiple values separated by spaces is followed by the required directory name and patterns. This will be confusing to the user, because we explicitly say the directory name and patterns should come last. In fact, I think they should be able to be provided in any order. The only fix for this is to change the behavior of --add-extra-file-info and --sort-by. I think they should each take a comma-separated list of items instead of a + set of arguments. Then just do a split(',') on the values to get the argument values. You'll have to write your own validation function now, since argparse won't be able to do it. But this is easy to do. Here's an example:
def _validate_planet(planet):
    """Routine for argparse to validate a planet."""
    if not planet in ("jupiter", "saturn", "uranus", "neptune"):
        raise argparse.ArgumentTypeError(
             f"{planet} is not a valid planet - must be one of "
              "jupiter, saturn, uranus, neptune")
    return planet

group.add_argument(
    "--planet", default="saturn",
    type=_validate_planet,
    help=f"""Which planet to process: jupiter, saturn, uranus, neptune
             (saturn is the default)""")

Currently, --sort-by can sort by anything, not just the values in --add-extra-file-info. Do we want to limit --sort-by to only sort by the allowed values for --add-extra-file-info, or continue to allow users to sort by anything?

rfrenchseti commented 2 months ago

Currently, --sort-by can sort by anything, not just the values in --add-extra-file-info. Do we want to limit --sort-by to only sort by the allowed values for --add-extra-file-info, or continue to allow users to sort by anything?

I'm not sure where this question is coming from. Why would we change the behavior of --sort-by? The only thing I'm saying is that instead of: --sort-by field1 field2 field3 we require --sort-by field1,field2,field3. Just take a single string argument instead of "+", and split on commas to get the list of fields. For the additional file info fields, you'll have to validate the entries in that new list.

esimpsons3ti commented 2 months ago

Note: Unit test for label generation have not been added yet. I'm just making a new commit to contain the changes from the above comments.

rfrenchseti commented 2 months ago

I am approving this pull, with the understanding that there are additional issues that need to be addressed, including:

18
17
16

SETI / rms-pds4indextools

PDS4 Indexing Tools Improvements and Additions #15

Welcome to Codecov :tada:

18

17

16