CybercentreCanada / assemblyline

AssemblyLine 4: File triage and malware analysis
https://cybercentrecanada.github.io/assemblyline4_docs/
MIT License
244 stars 15 forks source link

EPUB identified as java/jar #203

Closed kam193 closed 8 months ago

kam193 commented 8 months ago

Describe the bug Following file: file.epub.zip (source, pass: zippy) is identified as java/jar, although it's a valid EPUB file.

To Reproduce Steps to reproduce the behavior:

  1. Upload a file and look at the file type

Expected behavior File is identified as archive/zip or a more specialized type allowing Extractor to unpack data inside.

Environment (please complete the following information if pertinent):

gdesmar commented 8 months ago

If you're in a hurry, you can set use_custom_safelisting to False in Extract to bypass java specific (and others) safelisting until it is correctly fixed. I will add identification for epub files as document/epub, and handle it like a simple zip in Extract. I see that Characterize gives a most of the characteristics for it already through exiftool. I'm thinking that a screenshot preview of the first page would be nice, but beside that, would you see any other need or things I'm missing regarding epubs?

kam193 commented 8 months ago

Don't worry, there is no hurry - just an improvement request. A preview of the first page sounds great, I don't feel there is anything more useful at the moment. Extracting content should allow analyzing whatever there is inside. Would you have time to ensure other ebook formats (Mobi? azw3?) are also supported in the same way? It's probably almost the same way to do.

gdesmar commented 8 months ago

The unittests/integration tests within Assemblyline are going to be on the light side, because I want to make absolutely sure we do not get into any trouble with books, but from my manual testing, it should all work.

Mobi files and AZW3 files, which are based on Mobi files, are going to be identified as document/mobi. Extract is going to be able to extract their content thanks to the https://github.com/iscc/mobi library. It looks like there are chances the mobi file end up extracting an epub file, which would go back to Extract and extract again. The system should use caching for duplicated images and files, but if you find it bothersome in the filetree or for some other reasons, you're welcomed to open another ticket for Extract. Instead of an epub, it may extract an html or pdf file, and I wasn't sure if in those case we still wanted it.

I very quickly looked into other ebook formats, and am not certain how often people would need to analyze them, or how much of an attack vector they are. A few of them have not been updated for a long time, like DjVu since 2005 and FictionBook since 2008, and are very difficult to identify (FictionBook is just a big xml file). For those reasons, I'll wait for direct request for other ebook format support.

For now, we need a new minor core release/build for Identify to be updated. I don't plan on triggering it until we get a few more things in, but keep an eye out for 4.5.0.6 or higher, and then the next Extract build. :)

kam193 commented 8 months ago

Great, thank you! I believe EPUB/Mobi are the most important, and when it comes to extracting data: I would extract everything because a) it prevents hiding content inside a book archive, b) triggers specialized services to look at the core files.

When it comes to other formats, not only books: I think the most essential is to extract content if the file is any form of a known archive. In this case, I wouldn't bother you with the identification if the file had been identified as an archive and extracted.