Code review from Robin Wilson

As requested by @IanMayo, here is an initial code review of the code in parser/. I've included a lot of references to specific lines in specific files - hopefully Github will make them into nice links, but if not then it should be relatively easy to find them using VS Code (Ctrl-G triggers the Go To Line functionality).

Please feel free to ask for any clarification on any of these, or express disagreement (a lot of these are open to interpretation and I am quite happy to discuss them).

General notes

Overall assessment: In general, the code is well-organised and well structured. Functions are often quite long, but make sense as a conceptual unit so I don't think they should be split. There are some issues with repeated code and use of Python-specific features as detailed below, and path manipulation is a particular concern. Some comments don't match the code (or vice versa!). The htmlToDITA function is particularly well structured and documented - some would say it should be replaced with a series of functions to do each transformation, which are called in turn on the soup, but I don't think that would actually make it any clearer or easier to maintain.
In general, proper path joining functions should be used to join paths, rather than f-strings or other manual string manipulation. Examples include parser/parser.py:23, parser/parser.py:29 and many more. You might want to investigate PathLib which lets you do nice things like base_path / dir1 / dir2 and it will do the path joining for you. I think that could significantly clean up various parts of the codebase. It can also replace various calls like os.path.dirname etc.
Docstrings for most functions would be useful
There are various calls to the replace_characters function to get rid of %20 (eg. parser/class_files.py:114). I assume you're doing this to remove URL-encoded spaces from the filenames. That makes sense, but might there be other characters that are URL encoded that need updating? You could use a function like urllib.parse.unquote to remove all of them and then just remove the spaces? Might be more robust - but not essential. Need input from original author
A useful new function could be something like write_prettified_xml which takes a dita_soup variable and does parser/reference_files.py:76 to parser/reference_files.py:79 returning the filename - this could be used in various places and would reduce a bit of duplication.
parser/parser_utils.py:83: Not sure why you've got a definition for __all__ here or in various other utility files - that's normally just in an __init__.py, and importing should work automatically without __all__ in a normal Python file.

Specific notes

~~lxml needs to be added to requirements.txt~~
In general, don't commit the .vscode directory - it took over my editor settings when I opened it up, specifically by making the entire editor orange, which made me jump a bit! This directory is usually for user-specific custom editor settings, which don't need to be shared
~~parser/parser.py:112: What is ns in the name of this function? Being in the name of the function makes me think it is important, but scares me that I can't work out what it is.~~
parser/parser.py:388: We should check the output of subprocess.run here, as we currently don't know if it has succeeded or not - and then in the next few lines we say that the processing is complete, even though if the subprocess.run line has failed then it won't actually have completed. We can probably just check the return code of the process, but might need to check the actual output.
~~parser/parser.py:376: This seems to be a misleading name and comment - as far as I can tell, this is the entry to parsing the whole set of documents, not just the region map~~
parser/parser.py:34: This dita_soup variable is created here, used briefly to create an image tag, and then assigned to again at parser/parser.py:50 - and first assignment uses lxml-xml and the second just xml. I suspect the second definition can be moved up to replace the first one, but I'm not sure
~~parser/parser_utils.py:77: This actually removes multiple leading slashes~~
~~parser/parser.py:117: Comment outdated, this just sets up the HTML parsing~~
~~parser/parser.py:126: This exception isn't being caught anywhere - is this deliberate to crash the processing if it is raised? If not, it'll need to be caught at a higher level~~ Need input from original author Comment included
~~parser/parser_utils.py:18: This function can be replaced with os.makedirs with the argument exist_ok=True~~
parser/parser.py:187: This is a bit unpleasant - I think it could be simplified with proper use of os.path.join etc, or (ideally) pathlib. The same applies for the various bits of path manipulation code in the following lines.
parser/parser_utils.py:36: This could be simplified by replacing the first if with if not filenames: filenames = os.listdir(source_dir) and then just having one lot of copying commands (reduces code duplication)
~~parser/parser_utils.py:73: This function probably isn't needed, I don't think it is any simpler than the standard python version of s.replace('x', 'y')~~
~~parser/parser.py:221: The first argument to this function is unused - should it be used?~~ Need input from original author Fixed
~~parser/parser.py:270: tr_count in this for loop is unused - if you're not using it then you can remove the enumerate call. Same applies a couple of lines later.~~
parser/parser.py:343: It looks like this call to get_files_in_path can be removed and you can just pass an empty list of files to copy_files, and then it'll copy all the files in the source folder. The same applies at parser/parser.py:210. That seems to be the only two places that get_files_in_path is used, so it can probably be removed after that.
~~parser/parser_utils.py:26: It's probably not worth having a function defined that just calls another function with the same arguments~~ Agreed
parser/class_files.py:81: The file_name argument is unused - should it be used? Need input from original author
parser/class_files.py:100: Why would entries in the children attribute of a tag not be BS4 tags? I can totally accept that they might not be (might they be raw text?), but it'd be good to have a comment here explaining what you're filtering out and why Need input from original author
parser/class_files.py:120: It looks like this TODO might have been done already? Need input from original author
parser/class_files.py:48: This error is being reported by a print statement, whereas various others have been reported by raising exceptions. Is there a reason for this difference?
parser/reference_files.py:9: This function needs a clearer docstring - it says it just converts a HTML file to DITA, but that's a description of the whole project - and more specifically of the htmlToDITA function. The docstring should explain what a non-class file is, and how it parses it.
parser/class_files.py:286: The code here and at parser/class_files.py:353 is very similar - can it be refactored in to one function, that takes an argument of what sort of tag to find (remarks or propulsion) and an argument of what function to call to process the relevant type of page - actually, they might even be the same function to be called just with different arguments - Leaving for now, as we're not using class_files.py at the moment
parser/html_to_dita.py:7: What is the purpose of this function? Is it just for development testing? It looks like it should be a unit test (and would probably be found automatically by the pytest runner), but doesn't have any asserts or similar in it. Need input from original author Done
parser/html_to_dita.py:35: This is a fairly complex function, so the docstring should reflect that. It would be useful to know why div's have to be replaced sometimes and not other times, and also useful to know the type of the return (it looks like it is a soup object and not a string of XML as is implied by the docstring) Need input from original author - I don't know why div's have to be replaced sometimes and not other times Have tried to extend DocString to explain the div strategy.
~~parser/html_to_dita.py:46: It looks like the TODO has been done. It would be useful to replace this comment with a comment explaining why a copy needs to be taken~~ Need input from original author - I don't know why the copy needs to be taken Have tried to explain in relevant TODO
parser/html_to_dita.py:51: This if statement seems unnecessary - you can just do the assignment and if the value is the same as what it's being replaced with then it'll stay the same. The only reason not to do this is if there is a performance impact, which I don't think there would be here.
~~parser/html_to_dita.py:49: This actually replaces it with whatever is in div_replacement, so the comment is misleading~~
~~parser/html_to_dita.py:58: This could do with a numbered header as it seems to be a different transformation, and not part of number 1~~
~~parser/html_to_dita.py:78:~~ This comment doesn't seem to match the code below it Done
~~parser/html_to_dita.py:87~~: What is A9 in this context? Need input from original author I have a lookup document of target files, indexed using this code. It is useful so I know which target document I found the pattern in
~~parser/html_to_dita.py:105: Does this need better urlencoding handling (see general notes) or are we sure we'll only find spaces and not any other invalid HTML characters?~~ using a off-the-shelf URL tidier would be fine
parser/html_to_dita.py:121: This comment and the one below it are out of step with the code - they should be h2 and h3 respectively. In fact, this code could be replaced with another loop that loops over "h1", "h2" and "h3".
~~parser/html_to_dita.py:144: The repeated code here could be put in a for loop too~~
parser/html_to_dita.py:162: Please use os.path.join here (or pathlib) rather than just the string join function
parser/html_to_dita.py:183: Is this meant to be a function call to the clear() function? Need input from original author Yes, it looks like a bug
parser/html_to_dita.py:262: This is just checking whether it is one character long, rather than checking if it's actually just a newline. I doubt there are many single character bits of text, but it is probably worth checking for \n here instead/as well.

DeepBlueCLtd / LegacyMan

Code review from Robin Wilson #391

General notes

Specific notes