fkie-cad / fact_extractor

Standalone Utility for FACT-like extraction
GNU General Public License v3.0
80 stars 31 forks source link

Support for post-unpack plugins #80

Open Caesurus opened 3 years ago

Caesurus commented 3 years ago

What I'm finding in a bunch of cases is a need for unpack plugins to have access to multiple files at the same time in order to complete a full unpack. Here is an example... I ported the following go code to python: packsparseimg.go

The idea with this is that there is a rawprogram*.xml file that holds information about how to unpack the img files that go with it. So for instance there can be several userdata_x.img files, and the xml file contains the offsets of where to write the individual files in order to resemble the outer image.

This isn't a problem if the outer container format is known. For instance, if it's a tar file I could add functionality to the patool plugin to check for rawprogram*.xml files and process them. But now if the outer container is a 7z file, then I have to duplicate that functionality to the 7z plugin. I can put this functionality in a helper class that's available from several plugins, but that doesn't feel very extensible.

What I'd like to see is a different type of plugin that registers a file pattern to look for in extracted files, if found, it calls this plugin with the directory that contains the extracted files so that it can re-assemble the .img files into something another plugin can then unpack in isolation.

In my opinion this makes it more module and extensible, but would like your opinion about this. Are you open to this plan? If I implement this in a fork will you consider a PR with this functionality? If this doesn't sound appealing or you have objections I'd love to hear them and I'll adjust accordingly.

dorpvom commented 2 years ago

Hi, our current architecture makes this complicated to achieve but I understand the appeal. I initially thought we might be able to solve this with some kind of callback, but that would have to happen in FACT_core since this extractor only ever looks at a single file. A post unpack plugin that looks at the extracted content and searches for such an occurence would probably work, but I'm not sure there is a way to build that, that does not completely go against our current architecture. Even looking at FACT_core, the unpack scheduler does not care which container a processed file comes from, so it does not have an easy way to check for dependencies in multiple files from the same source. So. While I haven't encountered such a case, I'd be interested in supporting it. I think the next step should be a more specific sketch of where the new functionality would be added and how the interfacing components are affected. Then we could discuss and maybe we can even support you in implementing it.

Caesurus commented 2 years ago

My rough idea was to add a method to UnpackBase, register_post_plugin( search_pattern, unpacker_name_and_function ) that would store just like register_plugin, but would use a glob/rglob search pattern. EG rawprogram*.xml

The anatomy of the post extraction unpack plugin would be the same as any of the other plugins (name/version etc).

Then modify the Unpacker.unpack() function to do something along these lines:


    extracted_files, meta_data = self.extract_files_from_file(file_path, tmp_dir.name)
    extracted_files, meta_data = self._do_fallback_if_necessary(extracted_files, meta_data, tmp_dir.name, file_path)

+   extracted_files, meta_data = self.post_extraction_plugins(extracted_files, meta_data, tmp_dir.name, file_path)

    extracted_files = self.move_extracted_files(extracted_files, Path(tmp_dir.name))

The post_extraction_plugins() would basically just iterate through the registered patterns, try to match against the list of extracted_files, and then call the corresponding function if it is found. What I don't love about this is that it could add a nontrivial amount of time to the process if a lot of these patterns are registered.

I think that I may still like the idea of adding a filter to the mix as well. So for example, only scan for a specific pattern if the top level file mime-type was within a given list...

This seems like a fairly minimal set of changes to implement, and would make for a better implementation for these specific use cases...

Here are some use cases that I have encountered that rely on multiple files (usually there is a manifest file). Once these files are processed the genericFS can then extract files from the resulting images.

  1. block-based OTA android images https://invisiblek.github.io/lineage_wiki/extracting_blobs_from_zips.html They have a *transfer.list file that is used by sdat2img to unpack an image to a usable format

  2. Qualcomm QFIL utility splitting files and maintaining offsets in the rawprogram*.xml/patch.xml files.

  3. Custom packed formats. .rfw is one example, it also has a manifest file that's used to re-assemble files.

Caesurus commented 2 years ago

For the record: "_factextractor is so neat, I use it everyday" <-- Me... I just said that, and it's true.

dorpvom commented 2 years ago

What is the status of this by the way? Have you worked on a solution yet? I think we haven't on our side, though I see a possibility of trying out some ideas in the fall.

Caesurus commented 2 years ago

So... the short version... I haven't started implementing it.

The longer version: I have a system that uses fact_extractor to extract files from other files, and I recently added support of extraction deduplication. The assertion is:

Given a specific file hash and assuming the unpack plugins haven't changed... fact_extractor will extract exactly the same files each time.

So if there is an archive that occurs in multiple fw, we shouldn't spend resources extracting it every time we see it. There are some inherent complexities that come with this, because you need to store a bunch of information about what plugins were used, and complexity about handling archives within archives etc... anyway, I'm going off on a tangent.

All this to say that I rely on the plugin name and plugin version to determine if a file needs to be re-processed because of a change in the plugins.

Now, if a post-unpack plugin can potentially be run after any other plugins, it introduces more complexity to that logic, and that needs some serious consideration to get right. So I delayed working on this until it bothers me enough to implement.

Of course none of this is anything you, as maintainers, need to worry about. I'm encountering more instances where this will be useful though. It looks like Qualcomm really likes following this paradigm of having manifest files alongside binary files.

I will definitely ping you and provide an update if/when I implement something like this.