Closed robintw closed 1 year ago
Hello @robintw , I've tried to respond to the outstanding questions, using bold italics. I've also added some comments in the above PR.
I also note that the parser isn't currently running (warning about circular import), though I've no doubt you'll get to the bottom of it :-D
All items either resolved or overtaken-by-events (agreed in Slack)
As requested by @IanMayo, here is an initial code review of the code in
parser/
. I've included a lot of references to specific lines in specific files - hopefully Github will make them into nice links, but if not then it should be relatively easy to find them using VS Code (Ctrl-G triggers the Go To Line functionality).Please feel free to ask for any clarification on any of these, or express disagreement (a lot of these are open to interpretation and I am quite happy to discuss them).
General notes
htmlToDITA
function is particularly well structured and documented - some would say it should be replaced with a series of functions to do each transformation, which are called in turn on the soup, but I don't think that would actually make it any clearer or easier to maintain.parser/parser.py:23
,parser/parser.py:29
and many more. You might want to investigate PathLib which lets you do nice things likebase_path / dir1 / dir2
and it will do the path joining for you. I think that could significantly clean up various parts of the codebase. It can also replace various calls likeos.path.dirname
etc.replace_characters
function to get rid of%20
(eg.parser/class_files.py:114
). I assume you're doing this to remove URL-encoded spaces from the filenames. That makes sense, but might there be other characters that are URL encoded that need updating? You could use a function like urllib.parse.unquote to remove all of them and then just remove the spaces? Might be more robust - but not essential. Need input from original authorA useful new function could be something likewrite_prettified_xml
which takes a dita_soup variable and doesparser/reference_files.py:76
toparser/reference_files.py:79
returning the filename - this could be used in various places and would reduce a bit of duplication.parser/parser_utils.py:83
: Not sure why you've got a definition for__all__
here or in various other utility files - that's normally just in an__init__.py
, and importing should work automatically without__all__
in a normal Python file.Specific notes
lxml
needs to be added torequirements.txt
.vscode
directory - it took over my editor settings when I opened it up, specifically by making the entire editor orange, which made me jump a bit! This directory is usually for user-specific custom editor settings, which don't need to be sharedparser/parser.py:112
: What isns
in the name of this function? Being in the name of the function makes me think it is important, but scares me that I can't work out what it is.parser/parser.py:388
: We should check the output of subprocess.run here, as we currently don't know if it has succeeded or not - and then in the next few lines we say that the processing is complete, even though if the subprocess.run line has failed then it won't actually have completed. We can probably just check the return code of the process, but might need to check the actual output.parser/parser.py:376
: This seems to be a misleading name and comment - as far as I can tell, this is the entry to parsing the whole set of documents, not just the region mapparser/parser.py:34
: Thisdita_soup
variable is created here, used briefly to create an image tag, and then assigned to again atparser/parser.py:50
- and first assignment useslxml-xml
and the second justxml
. I suspect the second definition can be moved up to replace the first one, but I'm not sureparser/parser_utils.py:77
: This actually removes multiple leading slashesparser/parser.py:117
: Comment outdated, this just sets up the HTML parsingNeed input from original author Comment includedparser/parser.py:126
: This exception isn't being caught anywhere - is this deliberate to crash the processing if it is raised? If not, it'll need to be caught at a higher levelparser/parser_utils.py:18
: This function can be replaced withos.makedirs
with the argumentexist_ok=True
parser/parser.py:187
: This is a bit unpleasant - I think it could be simplified with proper use of os.path.join etc, or (ideally) pathlib. The same applies for the various bits of path manipulation code in the following lines.parser/parser_utils.py:36
: This could be simplified by replacing the first if withif not filenames: filenames = os.listdir(source_dir)
and then just having one lot of copying commands (reduces code duplication)parser/parser_utils.py:73
: This function probably isn't needed, I don't think it is any simpler than the standard python version ofs.replace('x', 'y')
Need input from original author Fixedparser/parser.py:221
: The first argument to this function is unused - should it be used?parser/parser.py:270
: tr_count in this for loop is unused - if you're not using it then you can remove the enumerate call. Same applies a couple of lines later.parser/parser.py:343
: It looks like this call toget_files_in_path
can be removed and you can just pass an empty list of files tocopy_files
, and then it'll copy all the files in the source folder. The same applies atparser/parser.py:210
. That seems to be the only two places thatget_files_in_path
is used, so it can probably be removed after that.Agreedparser/parser_utils.py:26
: It's probably not worth having a function defined that just calls another function with the same argumentsparser/class_files.py:81
: The file_name argument is unused - should it be used? Need input from original authorparser/class_files.py:100
: Why would entries in thechildren
attribute of a tag not be BS4 tags? I can totally accept that they might not be (might they be raw text?), but it'd be good to have a comment here explaining what you're filtering out and why Need input from original authorparser/class_files.py:120
: It looks like this TODO might have been done already? Need input from original authorparser/class_files.py:48
: This error is being reported by a print statement, whereas various others have been reported by raising exceptions. Is there a reason for this difference?parser/reference_files.py:9
: This function needs a clearer docstring - it says it just converts a HTML file to DITA, but that's a description of the whole project - and more specifically of thehtmlToDITA
function. The docstring should explain what a non-class file is, and how it parses it.parser/class_files.py:286
: The code here and atparser/class_files.py:353
is very similar - can it be refactored in to one function, that takes an argument of what sort of tag to find (remarks or propulsion) and an argument of what function to call to process the relevant type of page - actually, they might even be the same function to be called just with different arguments - Leaving for now, as we're not using class_files.py at the momentNeed input from original author Doneparser/html_to_dita.py:7
: What is the purpose of this function? Is it just for development testing? It looks like it should be a unit test (and would probably be found automatically by the pytest runner), but doesn't have any asserts or similar in it.Need input from original author - I don't know why div's have to be replaced sometimes and not other times Have tried to extend DocString to explain the div strategy.parser/html_to_dita.py:35
: This is a fairly complex function, so the docstring should reflect that. It would be useful to know why div's have to be replaced sometimes and not other times, and also useful to know the type of the return (it looks like it is a soup object and not a string of XML as is implied by the docstring)Need input from original author - I don't know why the copy needs to be taken Have tried to explain in relevant TODOparser/html_to_dita.py:46
: It looks like the TODO has been done. It would be useful to replace this comment with a comment explaining why a copy needs to be takenparser/html_to_dita.py:51
: This if statement seems unnecessary - you can just do the assignment and if the value is the same as what it's being replaced with then it'll stay the same. The only reason not to do this is if there is a performance impact, which I don't think there would be here.parser/html_to_dita.py:49
: This actually replaces it with whatever is in div_replacement, so the comment is misleadingparser/html_to_dita.py:58
: This could do with a numbered header as it seems to be a different transformation, and not part of number 1This comment doesn't seem to match the code below it Doneparser/html_to_dita.py:78
:: What is A9 in this context? Need input from original author I have a lookup document of target files, indexed using this code. It is useful so I know which target document I found the pattern inparser/html_to_dita.py:87
using a off-the-shelf URL tidier would be fineparser/html_to_dita.py:105
: Does this need better urlencoding handling (see general notes) or are we sure we'll only find spaces and not any other invalid HTML characters?parser/html_to_dita.py:121
: This comment and the one below it are out of step with the code - they should be h2 and h3 respectively. In fact, this code could be replaced with another loop that loops over "h1", "h2" and "h3".parser/html_to_dita.py:144
: The repeated code here could be put in a for loop tooparser/html_to_dita.py:162
: Please use os.path.join here (or pathlib) rather than just the string join functionparser/html_to_dita.py:183
: Is this meant to be a function call to theclear()
function? Need input from original author Yes, it looks like a bugparser/html_to_dita.py:262
: This is just checking whether it is one character long, rather than checking if it's actually just a newline. I doubt there are many single character bits of text, but it is probably worth checking for\n
here instead/as well.