airbnb / binaryalert

BinaryAlert: Serverless, Real-time & Retroactive Malware Detection.
https://binaryalert.io
Apache License 2.0
1.4k stars 187 forks source link

Error analyzing PDFs: `pdftotext` not found [JSONDecodeError] #92

Closed austinbyers closed 6 years ago

austinbyers commented 6 years ago

Background

yextend can actually parse PDFs to scan individual components, which is awesome! Unfortunately, this relies on pdftotext, a program not available in Lambda. So when BinaryAlert scans a PDF, yextend returns an empty string and the result is a JSONDecodeError

Desired Change

  1. Add error handling around yextend - if it fails for any reason, we should still continue with the regular analysis
  2. Bundle pdftotext in the Lambda dependencies (this may not happen in v1.1)
  3. Problems like this will be mitigated in the future once yextend supports portable installation