Refactoring command into main block

Bankde commented 5 years ago

I decide to put the command line argument part into the main block. This will allow other module to "import linkfinder" for further use.

I remove these lines because I notice that jsbeautifier is required for this module anyway.

# Newlines in regex? Important for CLI output without jsbeautifier
if args.output != 'cli':
    addition = ("[^\n]*","[^\n]*")
else:
    addition = ("","")

Also I don't really understand this part but I don't think output argument should alter input content so I remove this if args.output != 'cli':. If it is necessary by any mean, I will add it back.

def parser_file(content, regex, more_regex=None):
...
    # Beautify
    if args.output != 'cli': # <-- I don't know what this line does
        if len(content) > 1000000:
            content = content.replace(";",";\r\n").replace(",",",\r\n")
        else:
            content = jsbeautifier.beautify(content)

Also I try to remove global argument dependency. Accepting as function parameter will make the project easier to maintain and allow other code to import. I haven't done fixing this yet. I rush to fix parser_file function first so I could add unit-test into this project.

Any suggestion/change/opinion is appreciated. Also since I did change a lot of lines, (even I have tested on my part, it's might not be covered), further test is also appreciated.

GerbenJavado commented 5 years ago

This is very useful. I'll look at this in more detail tomorrow, but for now a big THANK YOU 😄

GerbenJavado commented 5 years ago

The reason why

if args.output != 'cli':
    addition = ("[^\n]*","[^\n]*")

Is there is because if we use CLI we don't care about the context of where the URL is in. Meaning we just want the URL without surrounding code. Because of this we dont need to beautify the code or explode based on ,;.

# Beautify (only if it is not CLI mode)
    if args.output != 'cli':

This also makes LinkFinder a lot quicker and is why we can't remove these if statements. Yes if we remove them it would still work. But the program would be a lot slower in CLI mode.

Bankde commented 5 years ago

I think I understand that now. Will see how can I refactor it tmr.

Thank for a review.

Bankde commented 5 years ago

@GerbenJavado Could you take a look again ?

I have wrapped the "surrounding code" regex into function mode "with context" and "without context".
I still leave output to cli default to "without context" and html default to "with context" as it is. We could add the option to switch mode later.

I haven't refactor'ed the whole module, only focus on regex parser right now so anyone else could easily run the test and improve the regex and able to call the function externally. Tell me if there is any function/code that need to be focused. I will look more into it.

GerbenJavado commented 5 years ago

For some reason:

python linkfinder.py -i https://www.nu.nl/ -o cli -d

Traceback (most recent call last):
  File "linkfinder.py", line 277, in <module>
    endpoint = cgi.escape(endpoint[1]).encode('ascii', 'ignore').decode('utf8')
IndexError: list index out of range

is broken. I'll look into this soon.

Bankde commented 5 years ago

Oops sry, I just noticed my silly mistake. Confirm error. I will look into it.

It is because of regex. I noticed that you decide not to include context in cli-mode so I remove unnecessary context detection from regex (resulting in cli-mode endpoint[1] won't exist). That's why I change endpoint[1] to endpoint[0] in cli_output.

Easiest way is to include the context back into cli-mode. Hacky version is to add mode condition like you have stated. Best version I could think is to re-ordering the regex result so endpoint[0] of both html and cli version are plain answer and let endpoint[1] be the answer with context. I will try this solution tonight.

Sorry for inconvenient, I have not include "-d" mode in my test. Is there anything else I should know or any command you usually include in test ?

Thank you so much for reviewing.

GerbenJavado commented 5 years ago

No @Bankde looks quite solid already. Also modifying someone else's code is never easy. Im already very happy with these changes.

Bankde commented 5 years ago

Should work now. The computing time is pretty much same as previous regex.

By the way, I have removed this line 189 group = list(filter(None, item)) I guess that it's to copy the list. However, I see that the list is not shared with any other function so copying might not be needed. I might have missed something tho.

GerbenJavado commented 5 years ago

As far as I know I implemented list(filter(None, item)) to prevent exact duplicates. Would removing this remove this feature?

Edit: exact duplicates as in context + url are the same. This is quite rare so removing it might not be the worst thing

Bankde commented 5 years ago

I think the line that remove duplicate was this one items = list(set(items)) which I have replaced with this one items = {item['link']:item for item in items}.values()

I also add test to check that no duplicate value is returned. Note that I could only test cli. Testing html result will be bigger work here.

GerbenJavado commented 5 years ago

Thanks for your awesome work!

GerbenJavado / LinkFinder

Refactoring command into main block #43