Refactoring (step up module architecture, establish central location for common functionality, cleanup linux analyzer)

luis261 commented 6 months ago

Module excecution improvements (for now only applied to the linux analyzer, other modules will be modified in dedicated PRs):

the simplest modules, which don't need any upfront input can simply be run via import
more involved modules shoud provide the following definitions:
- run: a function that takes in all required arguments as parameters, which would otherwise be provided via sys.argv if running the module directly in a shell (instead of inside Python via import)
- main: a function which wraps around run, executing it with sys.argv input
- an if statement that runs the main function if __name__ == "__main__"
in qu1cksc0pe.py, we can then import modules as required without the need for subshells/os commands (if they need input, we pass it via <modulename>.run)
this architecture let's us avoid the need for workarounds to pass inputs to analyzers, e.g. via the .path_handler tempfile (instead, we just pass any such needed information to a given module via a parameter of its local run function)

[!NOTE] changes below were already communicated/agreed upon previously:

Conversion of Modules directory into a Python package

this is worth it I think alone for sharing functionality (see below), but there is more to it than just that:

Having had enough experience myself with Python's quirky import system, I think I have some understanding for why it might be preferable to avoid dealing with it in some circumstances. Can you tell me the specific reasons though why you execute modules via os.system instead of staying inside a single Python interpreter instance if there are any?

If you're ok with it, I'll slowly move us towards staying inside a single Python instance by leveraging import statements instead of invoking modules directly via the operating system, since I think the advantages outweigh/justify using the established import system:

performance (launching a subshell via os.system every time we execute a module is more expensive in terms of overhead instead of simply importing into the current Python thread)
security (less risk of accidentally exposing/running different files on the system that aren't part of the program)

ideally we would rename the directory to be lowercase as part of this change to conform to Python module naming conventions .. I held off on it for now, since it's not strictly necessary and I'm unsure of the implications in terms of the other locations we'd have to adjust

Extraction of utility function previously strictly local to main module into separate module

might not seem sensible for now given that it only houses a single, tiny function
I'm sure it'll pay off in the long haul/soon though
- can continue extracting duplicate functionality there, to be reused across different modules

luis261 commented 5 months ago

@CYB3RMX

Sorry about the earlier information overload, I investigated this further in the meantime, here are the facts, as compact as I can manage:

when filtering out any occurences related to "file", only 14 occurences of raw sys.argv indexing remain
I analyzed the context for each of these code snippets: it's not problematic in these cases because all of these analyzers only get called from qu1cksc0pe.py in a manner where all expected values are present (so sys.argv is long enough => no IndexError)

Overall, we should be fine now. Again though, these IndexErrors weren't introduced by my refactoring, as evidenced by: https://github.com/CYB3RMX/Qu1cksc0pe/blob/master/Modules/winAnalyzer.py#L643 (main branch) (so the errors should also be reproducible on main at the current version: e11dfe4)

Qu1cksc0pe>git grep "sys.argv\[" | python txtfilter.py -x Modules/utils.py file get_argv
Modules/VTwrapper.py: apikey = str(sys.argv[1])
Modules/andro_familydetect.py:targetApk = sys.argv[1]
Modules/apkAnalyzer.py:targetAPK = sys.argv[1]
Modules/apkAnalyzer.py: if sys.argv[3] == "JAR":
Modules/apkAnalyzer.py: if sys.argv[3] == "DEX":
Modules/apkAnalyzer.py: if sys.argv[2] == "True":
Modules/email_analyzer.py:target_eml = sys.argv[1]
Modules/hashScanner.py: if str(sys.argv[1]) == '--db_update':
Modules/hashScanner.py: if str(sys.argv[2]) == '--normal':
Modules/packerAnalyzer.py: if str(sys.argv[1]) == '--single':
Modules/packerAnalyzer.py: elif str(sys.argv[1]) == '--multiscan':
Modules/pcap_analyzer.py:target_pcap = sys.argv[1]
Modules/powershell_analyzer.py:target_pwsh = sys.argv[1]
Modules/windows_dynamic_analyzer.py:target_pid = int(sys.argv[1])

CYB3RMX commented 5 months ago

No problem, man. I just meant to say let's come up with a solution for this too.

luis261 commented 5 months ago

Yes, thank you my friend (:

My most recent changes should fix the issues you found 🤞

luis261 commented 4 months ago

@CYB3RMX I know you're a busy guy but do you think there's any chance for us to target a merge this week?

CYB3RMX commented 4 months ago

Sorry, my friend. I've been really, really busy lately. But don't worry, I haven't forgotten :)

luis261 commented 4 months ago

I'm sorry to hear that, hope it gets better for you!

Thanks for letting me know, feel free to take all the time you need, as long as you don't forget, I am ok with waiting (:

CYB3RMX commented 4 months ago

Hello @luis261

I reviewed the changes you made. Everything seems fine. However, when I used the "--report" argument on the LinuxAnalysis side, I noticed some issues in the reports like this:

Could you please test the changes thoroughly and get back to me? I don't have enough time for testing these days.

luis261 commented 4 months ago

Hi @CYB3RMX

alright, sorry about that. I'll have to setup a suitable linux machine, which might take a while. But I'll get back to you once that's done and I've gotten around to testing it (:

luis261 commented 3 months ago

Ok, I'm done with the setup and testing run! The previous commit indeed fixes the markup issue.

However, regarding the Unicode Null character \u0000 in the interpreter value: it is also present on your master branch (reproduced by another testing run of mine, see image below), so entirely unrelated to my refactoring. My best guess is that it stems either from an error/issue internal to lief.parse or more likely a usage error in the way we call chr on the result of self.binary.get_section(sec_name).content

I suggest laying that minor issue aside for now and moving on with this change since these matters are unrelated. If you want, I'll open a small followup PR for that after this one gets merged.

luis261 commented 3 months ago

Alright, I've actually managed to expand on the original fix and do without a deepcopy. Since strings in Python are immutable, a shallow copy is sufficient here, meaning we can avoid the extra import of copy.deepcopy (or even copy.copy for what it's worth) and instead just use dict.copy in the commit I pushed just now: ff7c29041d3db60ac96074b86ccc12ad8cf371f8. This also leads to a small performance gain over both the previous version as well as the pre-refactored code, since the other values besides the modified string and the keys of the dictionary remain shared, without unnecessary copies (such copies existed in all non-faulty versions of the code)

I just retested on this latest commit and it still works.

CYB3RMX commented 3 months ago

Hello again @luis261 !!

I checked your commits and everything seems okay now :)

Now I finally accept your pull request man :D

luis261 commented 3 months ago

@CYB3RMX that's so exciting, It has been a pleasure working with you so far!

I'm really grateful to be a part of this, I appreciate your commitment and support with striving for increased code quality, looking forward to further collaboration (:

CYB3RMX commented 3 months ago

You're welcome man :) Thank you for your work!!

CYB3RMX / Qu1cksc0pe