GaloisInc / macaw

Open source binary analysis tools.
BSD 3-Clause "New" or "Revised" License
208 stars 21 forks source link

Simulation failure on AArch32: `ldm` semantics perform an unnecessary memory read #266

Open RyanGlScott opened 2 years ago

RyanGlScott commented 2 years ago

Consider this C program, appropriately decorated for use in macaw's test suite:

// bug.c
int __attribute__((noinline)) test_string_lit() {
  char x[5] = "ABCD";
  return x[0] == 'A';
}

void _start() {
  test_string_lit();
}

Compile this program with the an AArch32 version of musl-gcc:

$ armv7m-linux-musleabi-gcc -nostdlib -static -fno-stack-protector -no-pie bug.c -o bug.exe
$ armv7m-linux-musleabi-objdump -d bug.exe 

bug.exe:     file format elf32-littlearm

Disassembly of section .text:

00010074 <test_string_lit>:
   10074:       e52db004        push    {fp}            ; (str fp, [sp, #-4]!)
   10078:       e28db000        add     fp, sp, #0
   1007c:       e24dd00c        sub     sp, sp, #12
   10080:       e59f2038        ldr     r2, [pc, #56]   ; 100c0 <test_string_lit+0x4c>
   10084:       e08f2002        add     r2, pc, r2
   10088:       e24b300c        sub     r3, fp, #12
   1008c:       e8920003        ldm     r2, {r0, r1}
   10090:       e5830000        str     r0, [r3]
   10094:       e2833004        add     r3, r3, #4
   10098:       e5c31000        strb    r1, [r3]
   1009c:       e55b300c        ldrb    r3, [fp, #-12]
   100a0:       e3530041        cmp     r3, #65 ; 0x41
   100a4:       03a03001        moveq   r3, #1
   100a8:       13a03000        movne   r3, #0
   100ac:       e20330ff        and     r3, r3, #255    ; 0xff
   100b0:       e1a00003        mov     r0, r3
   100b4:       e28bd000        add     sp, fp, #0
   100b8:       e49db004        pop     {fp}            ; (ldr fp, [sp], #4)
   100bc:       e12fff1e        bx      lr
   100c0:       0000004c        .word   0x0000004c

000100c4 <_start>:
   100c4:       e92d4800        push    {fp, lr}
   100c8:       e28db004        add     fp, sp, #4
   100cc:       ebffffe8        bl      10074 <test_string_lit>
   100d0:       e1a00000        nop                     ; (mov r0, r0)
   100d4:       e8bd8800        pop     {fp, pc}

Now run this test harness on bug.exe (the code was extracted from macaw-aarch32-symbolic's test suite):

{-# LANGUAGE DataKinds #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE GADTs #-}
{-# LANGUAGE ImplicitParams #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeApplications #-}
{-# LANGUAGE TypeOperators #-}
module Main (main) where

import           Control.Lens ( (^.) )
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as BS8
import qualified Data.ElfEdit as Elf
import qualified Data.Foldable as F
import qualified Data.Map as Map
import           Data.Maybe ( mapMaybe )
import qualified Data.Parameterized.Classes as PC
import qualified Data.Parameterized.Nonce as PN
import           Data.Parameterized.Some ( Some(..) )
import           Data.Proxy ( Proxy(..) )

import qualified Language.ASL.Globals as ASL
import qualified Data.Macaw.Architecture.Info as MAI

import           Data.Macaw.AArch32.Symbolic ()
import qualified Data.Macaw.ARM as MA
import qualified Data.Macaw.ARM.ARMReg as MAR
import qualified Data.Macaw.Discovery as M
import qualified Data.Macaw.Symbolic as MS
import qualified Data.Macaw.Symbolic.Testing as MST
import qualified What4.Config as WC
import qualified What4.Expr as WE
import qualified What4.Expr.Builder as WEB
import qualified What4.Interface as WI
import qualified What4.ProblemFeatures as WPF
import qualified What4.Solver as WS

import qualified Lang.Crucible.Backend as CB
import qualified Lang.Crucible.Backend.Online as CBO
import qualified Lang.Crucible.Simulator as CS
import qualified Lang.Crucible.LLVM.MemModel as LLVM

main :: IO ()
main = do
  -- These are pass/fail in that the assertions in the "pass" set are true (and
  -- the solver returns Unsat), while the assertions in the "fail" set are false
  -- (and the solver returns Sat).
  let passRes = MST.SimulationResult MST.Unsat
  mkSymExTest passRes "bug.exe"

hasTestPrefix :: Some (M.DiscoveryFunInfo arch) -> Maybe (BS8.ByteString, Some (M.DiscoveryFunInfo arch))
hasTestPrefix (Some dfi) = do
  bsname <- M.discoveredFunSymbol dfi
  if BS8.pack "test_" `BS8.isPrefixOf` bsname
    then return (bsname, Some dfi)
    else Nothing

mkSymExTest :: MST.SimulationResult -> FilePath -> IO ()
mkSymExTest expected exePath = do
  bytes <- BS.readFile exePath
  case Elf.decodeElfHeaderInfo bytes of
    Left (_, msg) -> fail ("Error parsing ELF header from file '" ++ show exePath ++ "': " ++ msg)
    Right (Elf.SomeElf ehi) -> do
      case Elf.headerClass (Elf.header ehi) of
        Elf.ELFCLASS32 ->
          symExTestSized expected ehi exePath MA.arm_linux_info
        Elf.ELFCLASS64 -> fail "64 bit ARM is not supported"

symExTestSized :: MST.SimulationResult
               -> Elf.ElfHeaderInfo 32
               -> FilePath
               -> MAI.ArchitectureInfo MA.ARM
               -> IO ()
symExTestSized expected ehi exePath archInfo = do
   binfo <- MST.runDiscovery ehi exePath MST.toAddrSymMap archInfo MA.armPLTStubInfo
   let funInfos = Map.elems (MST.binaryDiscState (MST.mainBinaryInfo binfo) ^. M.funInfo)
   let testEntryPoints = mapMaybe hasTestPrefix funInfos
   F.forM_ testEntryPoints $ \(name, Some dfi) -> do
     putStrLn ("Testing " ++ BS8.unpack name ++ " at " ++ show (M.discoveredFunAddr dfi))
     Some (gen :: PN.NonceGenerator IO t) <- PN.newIONonceGenerator
     sym <- WEB.newExprBuilder WEB.FloatRealRepr WE.EmptyExprBuilderState gen
     CBO.withYicesOnlineBackend sym CBO.NoUnsatFeatures WPF.noFeatures $ \bak -> do
       -- We are using the z3 backend to discharge proof obligations, so
       -- we need to add its options to the backend configuration
       let solver = WS.z3Adapter
       let backendConf = WI.getConfiguration sym
       WC.extendConfig (WS.solver_adapter_config_options solver) backendConf

       execFeatures <- MST.defaultExecFeatures (MST.SomeOnlineBackend bak)
       let Just archVals = MS.archVals (Proxy @MA.ARM) Nothing
       let extract = armResultExtractor archVals
       let ?memOpts = LLVM.defaultMemOptions
       simRes <- MST.simulateAndVerify solver WS.defaultLogData bak execFeatures archInfo archVals binfo MST.DefaultMemModel extract dfi
       if expected == simRes
         then putStrLn "Test passed"
         else fail $ "Expected " ++ show expected ++ ", but got " ++ show simRes

-- | ARM functions with a single scalar return value return it in %r0
--
-- Since all test functions must return a value to assert as true, this is
-- straightforward to extract
armResultExtractor :: ( CB.IsSymInterface sym
                      )
                   => MS.ArchVals MA.ARM
                   -> MST.ResultExtractor sym MA.ARM
armResultExtractor archVals = MST.ResultExtractor $ \regs _mem k -> do
  let re = MS.lookupReg archVals regs (MAR.ARMGlobalBV (ASL.knownGlobalRef @"_R0"))
  k PC.knownRepr (CS.regValue re)

If you run this, it will unexpectedly fail:

$ runghc Bug.hs
Testing test_string_lit at 0x10074
Bug.hs: user error (Expected SimulationResult Unsat, but got SimulationResult Sat)

Note that if you change the array to be of a smaller length, e.g.,

  char x[2] = "A";

Then the test will pass.

@travitch suspects that the issue may arise from macaw-symbolic's interval logic here.

RyanGlScott commented 2 years ago

After quite a bit of investigation, it seems likely that the issue lies in the way macaw-semmc converts the translated What4 expressions into Macaw. In particular, the semantics for the LDM instruction always add this ReadMem assignment, which is later passed to a Mux:

  val_aVCFX <- let
                 memRep_aVCFV
                   = (MC.BVMemRepr (S.knownNat :: S.NatRepr 4)) MC.LittleEndian
                 assignGen_aVCFW
                   = GHC.Base.join
                       (Data.Macaw.SemMC.Generator.addAssignment
                          <$>
                            ((MC.ReadMem <$> pure val_aVCFU)
                               <*> pure memRep_aVCFV))
               in (MC.AssignedValue <$> assignGen_aVCFW)
  val_aVCFY <- ((Data.Macaw.ARM.Semantics.ARMSemantics._df_df_getSlice_4N_32length_1_norm
                   val_aVCFX)
                  lval_aVCEi)
                 lval_aVCFf
  val_aVCFZ <- Data.Macaw.SemMC.TH.addApp
                 ((MC.Eq lval_aVCEh) val_aVCFY)
  let lval_aVCG0
        = (Data.Macaw.SemMC.TH.bvValue (S.knownNat :: S.NatRepr 1)) 0
  val_aVCG1 <- Data.Macaw.SemMC.TH.addApp
                 ((((MC.Mux (MT.BVTypeRepr (S.knownNat :: S.NatRepr 1)))
                      val_aVCFZ)
                     lval_aVCEh)
                    lval_aVCG0)

If you go far back enough in the generated code, you'll find that this particular Mux's condition concretely evaluates to false, so in theory, the ReadMem shouldn't be needed. But the ReadMem assignment is added anyway, since the addAssignment function is eagerly invoked using a monadic bind. Ugh.

I've tried lots of different tweaks in the macaw-semmc code generator to make things lazier, but nothing quite seems to work.