armijnhemel / binaryanalysis-ng

Binary Analysis Next Generation (BANG)
GNU Affero General Public License v3.0
461 stars 66 forks source link

Suggest adding methods to scan folders and to detect duplicated files #357

Closed chimelab closed 9 months ago

chimelab commented 9 months ago

Q1. I've been studying it for a few of days now. This is a very excellent project, better than BAT in many aspects. I really hope it can support scanning directories, like bat does. I tried modifying it but failed. Therefore, I hope the development team can provide such a method. This would eliminate the need to decompress again when rescanning large projects, or allow me to use an external decompressor.

Q2. It's important to ignore duplicated files, to improve performance. I did changes as below: First, add a line right after initiating "ignored": scan_environment.processed = dict() Then, mark and bypass those duplicated files as below: labels = ['ignored'] scanjob.meta_directory.info.setdefault('labels', labels) continue if hashes['sha256'] in scan_environment.processed: labels = ['duplicated'] scanjob.meta_directory.info.setdefault('labels', labels) continue

            scan_environment.processed.add(hashes['sha256'])

        # start the pipeline for the job
        pipeline(scanjob.scan_environment, scanjob.meta_directory)
armijnhemel commented 9 months ago

Re-adding the directory scanning option should be relatively straightforward. I will work on this in the coming few weeks.

Regarding detecting and skipping duplicate files: this is a bit trickier. The problem that I have with this is that it requires quite a bit of bookkeeping to make sure that in the end the data is correct. This would mean I need to rework some code. It is on my TODO list, but it is not my highest priority.

armijnhemel commented 9 months ago

Q1. I've been studying it for a few of days now. This is a very excellent project, better than BAT in many aspects. I really hope it can support scanning directories, like bat does. I tried modifying it but failed. Therefore, I hope the development team can provide such a method. This would eliminate the need to decompress again when rescanning large projects, or allow me to use an external decompressor.

An external decompressor should not be needed if everything is in one large archive. Can you describe your use case a little bit more (as I am curious about it)?

armijnhemel commented 9 months ago

Re-adding the directory scanning option should be relatively straightforward. I will work on this in the coming few weeks.

2877c4fb8a3f222c7b5771934f4ebb8b9d7f4f08

chimelab commented 9 months ago

Q1. I've been studying it for a few of days now. This is a very excellent project, better than BAT in many aspects. I really hope it can support scanning directories, like bat does. I tried modifying it but failed. Therefore, I hope the development team can provide such a method. This would eliminate the need to decompress again when rescanning large projects, or allow me to use an external decompressor.

An external decompressor should not be needed if everything is in one large archive. Can you describe your use case a little bit more (as I am curious about it)?

Sometimes, packages are encrypted, or packed with private algorithms; and often a relative extracting tools are also provided. Integrating these tools into bang parsers is hard and almost impossible. But if directories can be an option, or a hook before each nested unpacking, it would be easy to handle with a script. I am a user of bat tool. I separated the bat unpack functions from bat-scan, and always provide a directory to bat-scan. These changes make it possible to use external tools and debug easier.

chimelab commented 9 months ago

Re-adding the directory scanning option should be relatively straightforward. I will work on this in the coming few weeks.

Regarding detecting and skipping duplicate files: this is a bit trickier. The problem that I have with this is that it requires quite a bit of bookkeeping to make sure that in the end the data is correct. This would mean I need to rework some code. It is on my TODO list, but it is not my highest priority.

Looking forward to it in the future. Thanks Dr.

armijnhemel commented 9 months ago

An external decompressor should not be needed if everything is in one large archive. Can you describe your use case a little bit more (as I am curious about it)?

Sometimes, packages are encrypted, or packed with private algorithms; and often a relative extracting tools are also provided. Integrating these tools into bang parsers is hard and almost impossible. But if directories can be an option, or a hook before each nested unpacking, it would be easy to handle with a script. I am a user of bat tool. I separated the bat unpack functions from bat-scan, and always provide a directory to bat-scan. These changes make it possible to use external tools and debug easier.

OK. This makes sense. If there are any extraction tools or options that you think are missing and for which there are specifications, then please let me know (I already know about RAR, that's on my TODO list).

I have added a directory scanning option, although it is not ideal and there might be some performance issues (files in the top level directory will not be processed in parallel). Although I could fix that it would require reworking the code quite a bit and might lead to other issues or annoyances (like not exactly knowing when an archive has been scanned completely). I think that this is an acceptable compromise.