Changes to the asar format? Magic number, checksum, filesize

bwin commented 9 years ago

Currently an asar archive consists of these parts:

HEADERSIZE HEADER FILES

I would like to add a magic number(/string) to the beginning like 'ASAR' or something similar. This would be backwards incompatible, but we could read it in the old format if it doesn't start with the magic string and show a deprecation notice for now. Why? Currently we would read any file, interpret the first 8 bytes as UINT64 and try to read that much bytes and try to interpret that as json. Although this works, IMHO a magic number (or file signature) would be a nicer way.

If you're (@zcbenz) ok with the idea, how about adding a checksum too? (When you do incompatible changes do them all at once.) In normal operation the checksum can be ignored, but it could be used to check the integrity of the file when someone wants to. It would also be possible to put the checksum into the json-header, but I don't like that.

I'm also in favor of #4, putting the filesize at the end. (This one doesn't break anything.)

BTW is there a reason that the header (json-filelist) gets pickled and not just written to the file? Is this faster or better in any way?

I'm also aware, that these changes need to be reflected in atom-shell to have any meaning at all.

It could look like this:

MAGIC HEADERSIZE CHECKSUM HEADER FILES FILESIZE

The position of HEADERSIZE and CHECKSUM could also be switched, I don't care.

bwin commented 9 years ago

@paulcbetts I don't know if you're using asar, but since I see you on nearly every atom-shell issue, maybe you got an opinion on this, too.

anaisbetts commented 9 years ago

We're not using Asar at the moment, we probably won't end up doing it (though it would probably improve install times quite a bit on Windows).

bwin commented 9 years ago

Ok, thanks.

Paul Betts notifications@github.com schrieb am So., 21. Dez. 2014 20:03:

We're not using Asar at the moment, we probably won't end up doing it (though it would probably improve install times quite a bit on Windows).

— Reply to this email directly or view it on GitHub https://github.com/atom/asar/issues/16#issuecomment-67780588.

zcbenz commented 9 years ago

I'm :+1: on adding magic number and checksum.

If you're (@zcbenz) ok with the idea, how about adding a checksum too? (When you do incompatible changes do them all at once.) In normal operation the checksum can be ignored, but it could be used to check the integrity of the file when someone wants to. It would also be possible to put the checksum into the json-header, but I don't like that.

I'm more willing to put the checksum in JSON header. I want the format itself to be as simple as possible, I think if a feature can be achieved without making the format more complicated, then we should keep the format unchanged.

BTW is there a reason that the header (json-filelist) gets pickled and not just written to the file? Is this faster or better in any way?

Pickle makes implementation simpler and safer, it takes care of compatibility between different platforms and checks whether you are reading the correct data, and it has a nice API to read/write strings.

bwin commented 9 years ago

@zcbenz Ok, just a quick question: What about ditching asar and using zip files instead? I think the answer is no, because of #673 (asar for runtime-mode). Is it?

bwin commented 9 years ago

I changed my mind (after playing around with bwin/asar-util) about a few things.

I would put the header behind the payload. This makes changing an archive easier/possible. Also appending is simple (overwrite current header with new payload and new header). Also this makes compressing files in one pass possible (because with the header in front, you need to know the compressed payload-file-size up front).
I would put a version field in the root of the json header. This is the file format version.
header-size should IMHO be an UINT32. I think json header > 4.2 GB is unrealistic. (otherwise UINT64)
max-file-size and max-archive-size is 1<<53 (max-header-size is 1<<32)
I'm not really sure about the checksum-part. You (Cheng) said, you want the checksum in json, but I don't like that. IMO generating this checksum is an uneccessary PITA because you have to null the checksum entry, to avoid including the checksum in generating the checksum. (I wrote that in a confusing way, admitted.) I don't really like my version either, because you're stuck with a fixed checksum size.

How to parse?

check magic
read headerSize and archiveSize at constant offset from end (depending on what happens with the checksum)
read header at constant offset from start (archiveSize - headerSize - sizeOfOtherFieldsAtEnd, avoids additional headerOffset value)
...

This is what it would look like asar-format

zcbenz commented 9 years ago

@zcbenz Ok, just a quick question: What about ditching asar and using zip files instead? I think the answer is no, because of #673 (asar for runtime-mode). Is it?

Using zip would make our code much more complicated, there are many other reasons but this is the biggest one.

I changed my mind (after playing around with bwin/asar-util) about a few things.

I really like your new design :+1:.

IMO generating this checksum is an uneccessary PITA because you have to null the checksum entry, to avoid including the checksum in generating the checksum.

You are right, I'm down to put the checksum as part of the format.

zcbenz commented 9 years ago

Hmm, why do we need a ARCHIVE-SIZE in header? Is it just the file size?

And did you mean size field will be UINT64 while keeping it less than 1<<53?

bwin commented 9 years ago

Thanks for the quick reply.

The archiveSize isn't in the header. It's at the end of the ~~file~~ archive (as in #4).

And yes, i would serialize the size as UINT64 while keeping it below 1<<53. It's still a valid UINT64, but we error out on the higher values.

zcbenz commented 9 years ago

The archiveSize isn't in the header. It's at the end of the file archive (as in #4).

Ah yeah, I forgot about that.

And yes, i would serialize the size as UINT64 while keeping it below 1<<53. It's still a valid UINT64, but we error out on the higher values.

:+1:

bwin commented 9 years ago

You are right, I'm down to put the checksum as part of the format.

I'm not 100% into my plan either. If we make it part of the format, we cannot change it to a longer checksum algo without breaking our format. Or we don't include it in our offset calculation (via a separate headerOffset), but we still don't know the length and algo used. So this info would go into the json header. Doesn't sound good to have it all over the place either.

I'm not sure about setting the checksum algo in stone. Or do we admit that it's just a basic check that the content isn't random garbage and say a MD5 (or SHA1 or whatever) will suffice forever (for what we want to provide)? What do you think?

zcbenz commented 9 years ago

A basic check is enough, let's just use the fixed length checksum, I don't think we need to worry too much about the future.

bwin commented 9 years ago

In asar-util I have implemented simple compression for files with gzip streams.

"some-file.txt": {
  offset: 1000,
  size: 5000, // decompressed size
  csize: 1234 // compressed size
  //comp: "gzip" or something to support different types of compression??
}

Are you willing to allow compression of file entries (on the other side)? Regarding file size this would make a lot of sense, considering the expected content (a lot of text). About slower loading when using compression. It would always be a choice (meaning you could always create an uncompressed asar). Or do you prefer compressing the asar for download (as zip archive or just through gzip) and keep the format without compression, trading file size for speed. Would be worth measuring if you're not sure if it's worth the trouble.

zcbenz commented 9 years ago

Are you willing to allow compression of file entries (on the other side)?

I prefer letting the users compress the asar archives themselves, e.g. don't allow compression of file entries.

As far as I can see, the asar archives can be used:

to be bundled with atom-shell when distributing apps
to be distributed alone when we have runtime mode of atom-shell
as a general format like tar

In the first case the whole app will be compressed when distributing so we don't need to compress asar archive alone. In the second case we can just compress the whole archive alone for download, and decompress it when installing, so no need to compress individual file entries. In the third case the archive can be compressed like tar, as asar.gz or asar.bz2.

bwin commented 9 years ago

I used numbers instead of strings for offset and size in json. Should work w/o problem, since we stay below 1<<53.
Is there a chance to switch this project to coffee-script? (I know you don't just go around asking that, but since this is a rewrite anyway and the other atom projects are also using coffee-script...)
I would prefer to serialize the header without chromium-pickle. It prepends 4 bytes (if I remember correctly), but since it's just text I don't really see why we would need that. We're only on little endian platforms anyway and I don't think we need chromium pickle at all.

A basic check is enough, let's just use the fixed length checksum

What should it be? MD5? SHA1? Something fancy?

About compression (examples from asar-util):

input: 815 files, 191 folders, 33.6MB (35.7MB on disk) (it's asar's node_modules)
asar: 33.7MB
compressed asar: 8.62MB The condition for compression was filesize > 500 or something like that.

In the third case the archive can be compressed like tar, as asar.gz or asar.bz2.

I don't see a use case for 3. But at first I liked the idea to keep my apps as .gz on disk. But this would (almost) always involve temporary files when running apps from. We could provide random access to all files in contrast to .gz or *.bz2 I think more of "after 2". If you have the runtime and (possibly a lot?) apps on your drive. It's still better than having to bundle the runtime with each, but dependencies can make an app quite big.

We should also think of a way to do efficient updates. In my app I used 3 different asar's for app, assets, dependencies to make updating more granular (which matters a lot, if you've changed just 2 lines of code and don't have to deploy 50MB of dependencies. I still do have to deploy the appcode or dependencies as a whole, so that isn't perfect either). But I see, that even this is not viable for the runtime version, it needs to be one file.

zcbenz commented 9 years ago

What should it be? MD5? SHA1? Something fancy?

I don't really have a preference, maybe just SHA1 since there is a very simple function in Chromium to compute SHA1 checksum.

But at first I liked the idea to keep my apps as .gz on disk. But this would (almost) always involve temporary files when running apps from. We could provide random access to all files in contrast to .gz or *.bz2

For downloaded apps I think we should always have decompressed archives, it is 2015 now the disk space should not be a problem. Even for an app as complicated as Atom it is still only 200~300mb decompressed.

As for the dependencies problem, I think we can have the runtime install and share the common dependencies instead of shipping the dependencies in apps.

zcbenz commented 9 years ago

Is there a chance to switch this project to coffee-script? (I know you don't just go around asking that, but since this is a rewrite anyway and the other atom projects are also using coffee-script...)

I'm fine with both JavaScript and CoffeeScript, I started with JavaScript because things were quite minimal, just go ahead if you want to convert everything to CoffeeScript.

I would prefer to serialize the header without chromium-pickle. It prepends 4 bytes (if I remember correctly), but since it's just text I don't really see why we would need that. We're only on little endian platforms anyway and I don't think we need chromium pickle at all.

We would have to write another C++ library for this if we don't use Pickle, because we have to read/write asar in both C++ and JavaScript. Pickle also has some nice features like simple type and length validation and String reading/writing, just a few bytes' cost is just fine in my mind.

bwin commented 9 years ago

I haven't really looked at this before, is atom-shell/atom/common/asar/* everything that needs to be changed "on the other side"?

zcbenz commented 9 years ago

I haven't really looked at this before, is atom-shell/atom/common/asar/* everything that needs to be changed "on the other side"?

It should be, some changes may also need moderations of:

atom-shell/atom/common/lib/asar.coffee
atom-shell/atom/browser/net/asar/*

YurySolovyov commented 9 years ago

I'm also in favor of replacing header size format and dropping Pickle dependency. Isn't node's buffer api enough to read a number even in c++? It has writeUInt32{BE,LE} and you can get endianness as require('os').endianness()

YurySolovyov commented 9 years ago

@zcbenz, idea is not to drop Pickle from atom-shell, (although it would be cool too), but form at least asar packager, so users don't need to compile native modules just to pack their apps

YurySolovyov commented 9 years ago

Proof of concept: gist All tests are passing and I was able to read my app's package, made by original packager with pickle. I can do clean-up and PR if you are ok with general approach. /cc @zcbenz

zcbenz commented 9 years ago

@YuriSolovyov I'm good replacing Pickle with a pure JavaScript implementation, I think a decent solution is to implement the same interface of chromium-pickle with JavaScript, and then replace chromium-pickle with it.

YurySolovyov commented 9 years ago

@zcbenz this one was for asar cli part, I'm not sure about atom-shell part.

zcbenz commented 9 years ago

@YuriSolovyov The JavaScript implementation should be compatible with the C++ Pickle, so we don't have to rewrite in atom-shel.

YurySolovyov commented 9 years ago

yeah, it is, at least test were passing and I was able to list package directory of .asar file created with "old" packager. can you please talke a look if it works for you? just replace one file that I gist'ed

question for better times: if we are about to make changes to packager, why not change format a little? is backwards compatibility that needed in this case? couple of bytes is not a big deal though.

YurySolovyov commented 9 years ago

So, after some trial and fail, we've got 2 things:

it IS possible to make packager in pure js with node apis
we still need access form c++ level in atom-shell to make some things work with asar.

One question so far: can we just use node/io apis to do c++ stuff? (this also allows to drop some deps) I mean hey, it is already there, why not use it?

zcbenz commented 9 years ago

@YuriSolovyov There are some C++ code also using asar in atom-shell (like tray module), so it is not possible to only use Node API.

@bwin I'm down to drop Pickle on both asar and atom-shell.

YurySolovyov commented 9 years ago

I assume tray needs it to display icon from asar package?

zcbenz commented 9 years ago

@YuriSolovyov Yeah, every API that needs an icon can read asar archives.

YurySolovyov commented 9 years ago

... and node fs module is patched only on js side?

zcbenz commented 9 years ago

Right.

YurySolovyov commented 9 years ago

Atm all that comes in mind will lead to code duplication of some sort:

C++:

You need to be able to decode archive headers and read files: basically this means you need to do same thing that js packager do, but in c++ -> get size -> get json string -> parse json -> cache header somewhere. (but cache it per session, so that if archive is changed and user restarts app, it needs to be refreshed) this is where duplication happens: this is just 1:1 of what packager does.
asar-aware apis need to be able to read files in archive by offset defined in archive header. not sure if there are api for that in chome, but it looks like libuv has some

JS: Most operations basically can be re-used from packager code, since it knows how to read from archive, this means asar packager could become dependency of atom-shell, just like other internal modules.

Thoughts?

zcbenz commented 9 years ago

The Chromium part and Node part of atom-shell are based on completely different C/C++ layers, so unfortunately we have to duplication code here.

YurySolovyov commented 9 years ago

Crazy idea: how about making atom executable work in packager mode if some flag is passed? like

atom.exe --create-package --path ./res --out app.asar

that would allow to write it once in atom-shell c++ layer and use both from c++ and js(via bindings)

joshuawarner32 commented 9 years ago

As a user of atom-shell, if I may bring up a few points:

Backwards compatibility is an issue. I don't think that should stop us from making backwards-incompatible changes to the format (I'm all for keeping cruft out of the design) - but it does mean that there should be very clear messaging about the change. Ideally, the formats will also be distinguishable (by magic number, for instance), so for at least a few releases, new versions of atom-shell will be able to read both the new and old format. You can drop support for the old format shortly thereafter, if you want. That way, we can manage the update process smoothly (we, as a company building apps on top of atom-shell). For background, we actually don't deploy the majority of our code in the binary installer package we give users; instead, that's downloaded at startup from an "update" server.
I really like @bwin's design - but I would include a two extra things in the magic number: a simple "format" version number ("ASAR2", rather than "ASAR"), and some character not in the 7-bit ASCII plane, to prevent asar files being accidentally transferred as text. Also, I'd use "\n" rather than "\r\n" for the same reason - to detect when some silly windows app decides to try to convert unix line endings to windows ones when processing the file. For inspiration, look at the png magic number (89 50 4e 47 0d 0a 1a 0a) (see: http://en.wikipedia.org/wiki/Portable_Network_Graphics)
Also regarding the new design, please don't perpetuate the use of MD5. I know this particular application is not really security-critical (well, probably...), but MD5 needs to die.
@YuriSolovyov, using atom.exe as a packager is fine, but I'd really want a stand-alone tool as well, so I don't have to manage access to atom-shell builds for both the build platform (linux) and the target platform (windows/mac).

YurySolovyov commented 9 years ago

@joshuawarner32 isn't asar format platform-independent? I think asar created on linux should work just fine on all rest platform as well, and vice versa... if it is not, then we should make it so.

joshuawarner32 commented 9 years ago

@YuriSolovyov well, technically (at least from reading the chromium-pickle code), it's not currently portable to big-endian architectures - but I don't think anybody cares, and that's not the issue here.

My issue is that, when building my app, I already manage download/extracting, modifying, and re-bundling of atom-shell builds for mac and windows. I don't want to add another layer of complexity in also managing downloading/extracting an atom-shell build for linux (my build platform). Also, while there are certainly workarounds, there actually isn't atom-shell build that works on debian wheezy, which represents the majority of our build slaves.

TL;DR: having a stand-alone tool to pack/unpack asars, independent of atom-shell, is really important to me. If you want to also integrate that functionality in atom.exe itself, far be it from me to stop you.

YurySolovyov commented 9 years ago

@joshuawarner32 well, that was just "crazy idea"

pombredanne commented 4 years ago

in this comment https://github.com/libarchive/libarchive/issues/1259#issuecomment-541777182 @kientzle rightly points that it is rather impractical to craft extraction of ASAR in libarchive that routes extraction based on content: there can be no magic without magic numbers ;) ... so is this really a dead ticket and proposal?

MarshallOfSound commented 1 month ago

Closing out this old issue as the ASAR format is now stable. Checksum / integrity support now exists. Prepending magic bytes to the format is something that can still be discussed, if an interested party finds this issue feel free to raise a new issue outlining what would be ideal

electron / asar

Changes to the asar format? Magic number, checksum, filesize #16