Open jwilk opened 7 years ago
Thanks! @jwilk I like how you use your tools. e.g. https://github.com/jwilk/pydiatra/ Very smart and duly noticed! would you care to submit a patch? Note that the code you find this bug in is actually not used and never was, though it could and should!
I don't quite understand the purpose of this function, so I'll somebody else patch it.
@jwilk sure. I will handle this and thanks again for the report. I am impressed by your tool! FWIW, the purpose is/was to detect if a string extracted from a binary may be a relative file path of sorts
In the end using a naive bayesian classifier may be more efficient. TBD.
@pombredanne Will the problem be solved if I change the re.compile to this?:
relative = re.compile(r'^(?:[^/]|\.+\/|\/)(([\w_\-]+\/)+|)[\w_\-]+\.[\w]+$', re.IGNORECASE).match
the parenthesis problem is fixed and I made some changes to make it more robust (I think xD have tested at regexr.com and it works fine ;)
but I'm not sure if it requires suffix such as.html
and .png
in every relative path we want to find here
Thanks!
@tina1998612 may be. The only way to know is to write unit tests. Note though that this code is NOT used for now so working on fixing this is very low priority.
Why not use Python's built-in function?
https://docs.python.org/2/library/os.path.html#os.path.isabs
@armijnhemel this definitely is a possibility, but you might need to call both posixpath
and ntpath
rather than the higher level os.path
wrapper as the paths in some arbitrary binary may be either posix or Windows irrespective of the os you are running. e.g. os.path
would resolve to ntpath
on Windows and therefore a function using os.path
would likely fail to handle posix paths conventions when analysing an Elf on Windows
Right, I hadn't thought of that. I guess you should also throw splitdrive() and splitunc() in the mix to be complete.
@armijnhemel yep, though that should not apply for detecting relative paths, but applies otherwise. Ah! things would be less fun without the joy of supporting non-posix platforms, would they?
So what is the use case of this? In build artefacts I encounter all sorts of paths.
@armijnhemel the use case there (which has not yet been implemented) is that given a bunch of strings from a random binary, we should be able to classify each string so extract some meaning. Is this a path? a relative path? some symbol? some literal? such that with that you could get better clues towards some origin detection and/or matching. At second thoughts, this not-implemented-yet attempt may not be needed or may not be the right approach. I had some experiments on using a naive bayes classifier, but I did not pursue as getting a proper large enough training data set was too complicated.
textcode.strings.is_relative_path()
is completely broken:Tested against git HEAD (bca0af4a6ff10f4ef0bef6459f2f06a17d21f9f6).
Found using pydiatra.