Preservation of file timestamps

Hello, thanks for accepting the job of maintaining the project

As you probably know, cpio files are handled as "special", using the plugins.cpio module rather than the configuration file:

https://github.com/OSPG/binwalk/blob/3dd13d7fd890f662ac6c75344906ebe50524200e/src/binwalk/plugins/cpio.py#L47

The extraction command is straightforward, but I've found the need to make a slight modification, specifically adding -m/--preserve-modification-time

I use that because it makes it easier to quickly get an idea of how old the files in the archive are

It's often especially helpful if you have, for example, a bundle of blobs with non-descriptive names, and you would like to identify the most recent

I don't think the current behavior is a bug and am therefore not sending a PR with this, unless there's some consensus about whether it's desirable or not

A couple of downsides of making this change that I can think of:

It may break someone's automated workflow, if said workflow depends on the mtimes being set to the time of unpacking (current behavior) rather than a time closer to the time of packing
The modification times in an archive may not always be correct or meaningful anyway, which could cause confusion

In my case, it's proven really useful, though

Thoughts on making this change (or accepting a very small PR for it)?

Thx

EDIT: I considered at one time having a runtime option in binwalk cli to influence this behavior for all archive types that supported timestamps, but ultimately decided it wasn't worth the effort for my one-off use-case, where I only needed it for cpio

Sounds like a good idea! It would have also the additional benefit of being reproducible. Extracting the same archive will always result in the same mtimes, which could be a benefit for some workflows. Also, even if the times are incorrect, it could give some hints to the user about how or when that archive was created.

If it's changed, though, I think it would be better to apply it to all type that supports it. I'm not so sure how to do it, however. Adding a global option to preserve mtime could be confusing for types that don't support it. Maybe defaulting to preserve mtime for all supported types and then adding an option to, forcefully, not preserve it could be less confusing. That would also mean that not preserving mtime for a type that supports it would be considered a bug.

If a workflow breaks because of the change to the defaults, there would be an option that would easily fix it (though it's unlikely that such a breakage occurs anyway).

Ideally, it would be great if we could gather more use cases and opinions. More realistically, my idea would be to work on a proposal, merge it and let it there for some time for users to test it. Eventually, if there are no complaints, a release will be made that incorporate those changes.

Sounds like a good idea! It would have also the additional benefit of being reproducible. Extracting the same archive will always result in the same mtimes, which could be a benefit for some workflows. Also, even if the times are incorrect, it could give some hints to the user about how or when that archive was created.

Great, happy you seem to agree with the potential utility/value

If it's changed, though, I think it would be better to apply it to all type that supports it. I'm not so sure how to do it, however. Adding a global option to preserve mtime could be confusing for types that don't support it.

Yeah, I agree. Though the global option is what I struggled with when I considered a "proper" option initially. Not the technical implementation per se, but identifying the cleanest least invasive way to pass the options along

I may be misremembering, but I seem to recall there being no scaffolding for this sort of thing in the code as it is now

And as for the general concept of influencing behavior of any (or all) extraction, it seems it was never intended. The only quick ways to do it right now that I saw were:

Editing the configuration file, for formats like tar, zip, etc.
Editing the format specific Python module (in this case, the cpio module)

You'll probably agree that the "configuration file" isn't really a configuration file, at least not in the traditional sense. It's not something a user should be editing often if at all, just an extension of the code (I'm referring to the file with the mapping of file types to the cascade of extraction commands)

Maybe defaulting to preserve mtime for all supported types and then adding an option to, forcefully, not preserve it could be less confusing. That would also mean that not preserving mtime for a type that supports it would be considered a bug.

I'm happy with that if you are. And if it breaks someone, they can "come out of the woodwork" and perhaps lend a hand 😊

If a workflow breaks because of the change to the defaults, there would be an option that would easily fix it (though it's unlikely that such a breakage occurs anyway).

To keep it quick/simple, this could be an environment variable in any early iteration (e.g. BINWALK_NO_MTIME=1 or similar). That avoids needing to deal with adding options to the entry-point script and plumb them along to the extraction functions

Ideally, it would be great if we could gather more use cases and opinions. More realistically, my idea would be to work on a proposal, merge it and let it there for some time for users to test it. Eventually, if there are no complaints, a release will be made that incorporate those changes.

Agree, and that thinking is consistent with my comment above regarding the figurative woodwork

I would love to say "I'll work on it" or "here's a PR" but my time is sapped right now. I do have enough time to participate in issues here and there if it's helpful, but I understand that's of limited practical value. It's hard finding time to sit down and write any code, no matter how simple

How much time do you have to dedicate to this project? I saw a flurry of development after the fork (the long overdue 2to3 effort, updates to deps.sh, etc) which was nice to see after years of the upstream neglect (no blame or criticism there, just an objective observation)

Looking forward to any improvements. I even appreciate the aesthetic stuff. Craig wrote really good C code (reaver, for example) and the Python code he wrote of course works- but he was clearly not a PEP8 nerd. We can forgive him for that, I think 😊

BTW, I have a few ideas about what could be done to improve the project- some are easier than others. One thing that comes to mind is starting a proactive effort to collect a corpus of different firmwares - for two reasons - one, for use in CI/test/cases and two, to identify poorly handled blobs and improve the handling of each over time. In my opinion it's most reasonable to make that a community effort (is "crowd-sourcing" still a buzzword?) but I'm not sure how to incentivize an effort like that. There's a lot less engagement from the "community" than there should be, in my opinion, considering how much value the tool provides. I also think it's fair to say that the majority of users would be completely lost without the tool- yet it seems to be taken for granted. End rant

To keep it quick/simple, this could be an environment variable in any early iteration (e.g. BINWALK_NO_MTIME=1 or similar).

I like the idea of having an env var. It will still look a bit hacky, but is probably the best we have without a major refactoring (that I'm not willing to make).

How much time do you have to dedicate to this project?

I either have too much time. I can dedicate some days writing code from time to time, but I'm more reactive than proactive. Everything that is not a bug is low priority for my part.

I have a few ideas about what could be done to improve the project- some are easier than others. One thing that comes to mind is starting a proactive effort to collect a corpus of different firmwares - for two reasons - one, for use in CI/test/cases and two, to identify poorly handled blobs and improve the handling of each over time. In my opinion it's most reasonable to make that a community effort (is "crowd-sourcing" still a buzzword?) but I'm not sure how to incentivize an effort like that.

What you propose is interesting. Given that there are some other tools in more active development (e.g: unblob), it could be useful to see how other projects approach it and, if they don't, try to find if there can be a collaborative effort between similar projects. If you want, feel free to open some issues about those ideas. Even if it's not code, I think it valuable. Maybe it enables a positive discussion about a topic, or it motivates someone to send a PR. At the very least, it helps me to know which parts need more attention. So don't hesitate to open issues :)

OSPG / binwalk

Preservation of file timestamps #21