Open Caesurus opened 3 years ago
Hi, our current architecture makes this complicated to achieve but I understand the appeal. I initially thought we might be able to solve this with some kind of callback, but that would have to happen in FACT_core since this extractor only ever looks at a single file. A post unpack plugin that looks at the extracted content and searches for such an occurence would probably work, but I'm not sure there is a way to build that, that does not completely go against our current architecture. Even looking at FACT_core, the unpack scheduler does not care which container a processed file comes from, so it does not have an easy way to check for dependencies in multiple files from the same source. So. While I haven't encountered such a case, I'd be interested in supporting it. I think the next step should be a more specific sketch of where the new functionality would be added and how the interfacing components are affected. Then we could discuss and maybe we can even support you in implementing it.
My rough idea was to add a method to UnpackBase
, register_post_plugin( search_pattern, unpacker_name_and_function )
that would store just like register_plugin
, but would use a glob/rglob search pattern. EG rawprogram*.xml
The anatomy of the post extraction unpack plugin would be the same as any of the other plugins (name/version etc).
Then modify the Unpacker.unpack()
function to do something along these lines:
extracted_files, meta_data = self.extract_files_from_file(file_path, tmp_dir.name)
extracted_files, meta_data = self._do_fallback_if_necessary(extracted_files, meta_data, tmp_dir.name, file_path)
+ extracted_files, meta_data = self.post_extraction_plugins(extracted_files, meta_data, tmp_dir.name, file_path)
extracted_files = self.move_extracted_files(extracted_files, Path(tmp_dir.name))
The post_extraction_plugins()
would basically just iterate through the registered patterns, try to match against the list of extracted_files
, and then call the corresponding function if it is found.
What I don't love about this is that it could add a nontrivial amount of time to the process if a lot of these patterns are registered.
I think that I may still like the idea of adding a filter to the mix as well. So for example, only scan for a specific pattern if the top level file mime-type was within a given list...
This seems like a fairly minimal set of changes to implement, and would make for a better implementation for these specific use cases...
Here are some use cases that I have encountered that rely on multiple files (usually there is a manifest file). Once these files are processed the genericFS can then extract files from the resulting images.
block-based OTA android images
https://invisiblek.github.io/lineage_wiki/extracting_blobs_from_zips.html
They have a *transfer.list
file that is used by sdat2img to unpack an image to a usable format
Qualcomm QFIL utility splitting files and maintaining offsets in the rawprogram*.xml/patch.xml
files.
Custom packed formats. .rfw
is one example, it also has a manifest file that's used to re-assemble files.
For the record: "_factextractor is so neat, I use it everyday" <-- Me... I just said that, and it's true.
What is the status of this by the way? Have you worked on a solution yet? I think we haven't on our side, though I see a possibility of trying out some ideas in the fall.
So... the short version... I haven't started implementing it.
The longer version: I have a system that uses fact_extractor
to extract files from other files, and I recently added support of extraction deduplication. The assertion is:
Given a specific file hash and assuming the unpack plugins haven't changed... fact_extractor will extract exactly the same files each time.
So if there is an archive that occurs in multiple fw, we shouldn't spend resources extracting it every time we see it. There are some inherent complexities that come with this, because you need to store a bunch of information about what plugins were used, and complexity about handling archives within archives etc... anyway, I'm going off on a tangent.
All this to say that I rely on the plugin name and plugin version to determine if a file needs to be re-processed because of a change in the plugins.
Now, if a post-unpack plugin can potentially be run after any other plugins, it introduces more complexity to that logic, and that needs some serious consideration to get right. So I delayed working on this until it bothers me enough to implement.
Of course none of this is anything you, as maintainers, need to worry about. I'm encountering more instances where this will be useful though. It looks like Qualcomm really likes following this paradigm of having manifest files alongside binary files.
I will definitely ping you and provide an update if/when I implement something like this.
What I'm finding in a bunch of cases is a need for unpack plugins to have access to multiple files at the same time in order to complete a full unpack. Here is an example... I ported the following
go
code to python: packsparseimg.goThe idea with this is that there is a
rawprogram*.xml
file that holds information about how to unpack the img files that go with it. So for instance there can be severaluserdata_x.img
files, and the xml file contains the offsets of where to write the individual files in order to resemble the outer image.This isn't a problem if the outer container format is known. For instance, if it's a
tar
file I could add functionality to thepatool
plugin to check forrawprogram*.xml
files and process them. But now if the outer container is a7z
file, then I have to duplicate that functionality to the7z
plugin. I can put this functionality in a helper class that's available from several plugins, but that doesn't feel very extensible.What I'd like to see is a different type of plugin that registers a file pattern to look for in extracted files, if found, it calls this plugin with the directory that contains the extracted files so that it can re-assemble the .img files into something another plugin can then unpack in isolation.
In my opinion this makes it more module and extensible, but would like your opinion about this. Are you open to this plan? If I implement this in a fork will you consider a PR with this functionality? If this doesn't sound appealing or you have objections I'd love to hear them and I'll adjust accordingly.