Support for ASAR Archives

kam193 commented 5 months ago

Is your feature request related to a problem? Please describe. I came across ASAR archives, which are Electron-app archives https://www.electronjs.org/docs/latest/tutorial/asar-archives It looks like they are currently only partially supported.

Describe the solution you'd like I would like a stable support for ASAR in Extractor & clear file identification.

Describe alternatives you've considered A separated service could be created, although it really fits perfectly into Extractor. Trouble is, that I don't see good non-JS extractors so far.

Additional context I came across two files, and AssemblyLine behaves a little strange for them.

0c1ddd33e630f4ac684880f0e673dfa84919272494c11da0f1ec05fb4f919ce8 This file was once identified as document/email, and in re-submit a few minutes later as archive/ar. The second time Extractor tried to extract data, but failed at all obraz obraz

abe19b0964daf24cd82c6db59212fd7a61c4c8335dd4a32b8e55c7c05c17220d This file was once identified as code/html, and then after resubmit twice as archive/ar. On one re-submit, Extractor failed at all with the "pre-empted" error, on the second try some files were extracted, although 7zip reported some errors (I didn't find the exact error in logs). obraz obraz

As I understand, AR is not the fully correct identification, but let the Extractor try. Wrong identifications were when the file was downloaded from the URL (both cases using my service - I had to set a specific User-Agent; can the service influence the identification?).

gdesmar commented 5 months ago

From identify's defaults:

    {"al_type": "archive/tar", "regex": r"^(GNU|POSIX) tar archive"},
    {"al_type": "archive/ar", "regex": r"ar archive"},
    {"al_type": "archive/vhd", "regex": r"^Microsoft Disk Image"},

The file's magic is "Electron ASAR archive, header length: 558428 bytes". That regex definitely needs be anchored or deleted. I made a PR to only look for ^current ar archive and add Electron's magic as archive/asar.

Regarding 7z's error, do you know in more detail how the asar archive is created? From their github page they claim it works like tar. From a quick test with abe19b0964daf24cd82c6db59212fd7a61c4c8335dd4a32b8e55c7c05c17220d (because I couldn't find 0c1ddd33e630f4ac684880f0e673dfa84919272494c11da0f1ec05fb4f919ce8 on VT or MB), 7z thinks the file is a gzip file into which there is a tar archive, similar to a .tgz. When Extract finds a single tar file inside an archive, it skips a step and extract that tar file. It tells you about it in that resultsection in case you wanted the associated hashes. Out of that tar, it looks like it can extract about 132 files and two folders. Using the (very old) pyasar library, it is able to extract about 2081 files in 451 folders. That is way more than we usually support in a submission. I do not know what the endgoal is here, or which files are more important, but there are a lot of svg (1237) which can contain javascript. The pyasar library doesn't look very maintained, and some samples I tested are crashing, but for those it would be a very easy fix by replacing line 219 with files = json.loads(header.strip('\x00')). (Alternative libraries?)

Regarding the wrong identification when your URL service is downloading that file, yes, you are totally right and the service does influence the identification if it's running as a privilege service. I assume it is. When a privileged service extracts a file and reuploads it, it does write directly to the file index. If you are unprivileged, you should go through service-server, which should already be up-to-date, so it would be a different problem. When you resubmit the file (I assume, by hash) it goes through the core, redo the identification like it would do with a new file, and update the fileinfo in the file index with the new type. If you rebuild you service using the latest base, you should have it identify it correctly on the first time. On a side note, my dev computer is using libmagic 5.39 while we are using libmagic 5.44 in our containers, and I do have some asar archive identify as document/email and code/html. If you can confirm the current libmagic in your URL downloader service, it would be great.

kam193 commented 5 months ago

Hey, thanks for the detailed analysis and fixing the identification!

First, I've uploaded the second file to VT. Both come from an info stealer which tried to replace cryptocurrency wallet app with them. So, re: end goal - I just wanted to find out, if I can easily see, what the real actions are. But you're right, 2k files doesn't sound reasonable to extract. Unfortunately, I don't have deeper knowledge about ASAR format (yet).

I think the best option currently is to stay with the identification update only. When I find time, I'll take a look at the format and craft a custom service for the extraction. Most probably with the approach I use for bundled Python executables - trying to estimate where the interesting code should usually be, and leaving full extraction optional.

Thanks for the clarification about identification. This is indeed the case, the #167 is still hitting my setup (although I have to check again, I've recently fixed some networking issues (it's always DNS)). And indeed the libmagic was out-dated... But I have already rebuilt the service for AL 4.5.0... But my configuration had an overridden container image with hard-coded older version :joy:

kam193 commented 1 month ago

FYI: I've created a simple service to extract ASAR. So far, the default filtering just omits node_modules directory. I suspect it may be good only for cases I was analysing, but will see

gdesmar commented 1 month ago

Someone from the community has tried to add it to JsJaws (CybercentreCanada/assemblyline-service-jsjaws/pull/726) but we have given no follow up on it (and a few things should be improved before merging). From my understanding, it would give about the same result by only checking to extract when isfiles is True, since node_modules is always a folder?

kam193 commented 1 month ago

I think so, although I don't know if node_modules is the only possible directory in the archive. I think the format doesn't exclude additional directories 🤔 I also give a possibility to extract everything (likely not useful 😆) and I have in head another option to put a regex/key to extract selected other files. I don't want to exclude exporting node modules completely, as I didn't find a way to verify the node packages weren't tampered

CybercentreCanada / assemblyline

Support for ASAR Archives #210