VirusTotal / yara

The pattern matching swiss knife
https://virustotal.github.io/yara/
BSD 3-Clause "New" or "Revised" License
7.95k stars 1.42k forks source link

Question on block lifetime requirements for incremental scanning #1994

Closed mavam closed 8 months ago

mavam commented 8 months ago

I'm in the middle of implementing a streaming engine with the YARA C API and would like to better understand the memory block lifetime invariants. The mental model I have is as follows:

  1. I can construct a YR_SCANNER and provide it a memory block iterator, such that periodic calls to yr_scanner_scan_mem_blocks "consume" zero or more blocks.
  2. By "consuming" I mean buffering blocks until the first instance of the YR_MEMORY_BLOCK_ITERATOR returns ERROR_SUCCESS.
  3. Returning ERROR_SUCCESS in the iterator is equivalent to triggering a scan. Returning ERROR_BLOCK_NOT_READY merely feeds the blocks to the scanner but doesn't kick off any scanning.

First question

Are my assumptions correct? If so, the the next question below goes into a concrete scenario. 👇

Second Question

Say I have memory blocks i, i+1, and i+2, and let my YR_MEMORY_BLOCK_ITERATOR return for block i and i+1 the status code ERROR_BLOCK_NOT_READY, and for i+2 ERROR_SUCCESS. (I call yr_scanner_scan_mem_blocks for every new block.) I assume that I need to keep the corresponding YR_MEMORY_BLOCK for i and i+1 valid/allocated until the call to yr_scanner_scan_mem_blocks accessing the iterator on block i+2 returns ERROR_SUCCESS.

Can I free the blocks i, i+1, i+2 after I called yr_scanner_scan_mem_blocks? I'd like to keep reusing the scanner for subsequent scans. So block i+3, i+4, etc. But I'm not sure if reusing the same scanner requires any form of cross-block random access which would require that all blocks of a scanner remain must remain in memory.

What I noticed during testing is that a rule that spans over two memory blocks successfully matches with the scanner, e.g., $foo and $bar where $foo matches on block i and $bar on block i+1. This led me to infer that the "scope" of a rule match is equivalent to the block sequence corresponding to the block iterator returning ERROR_BLOCK_NOT_READY up to the first instance of ERROR_SUCCESS.

plusvic commented 8 months ago

Are my assumptions correct? If so, the the next question below goes into a concrete scenario.

They are mostly correct, but YARA doesn't buffer the blocks at all. With every call to yr_scanner_scan_mem_blocks YARA starts asking the iterator for one block at a time, and the iterator is responsible for returning the blocks sequentially and in order. YARA keeps asking for more blocks until the last block is reached, or until one of the block is not ready to be passed to YARA, which is indicated by a ERROR_BLOCK_NOT_READY error.

When the iterator returns ERROR_BLOCK_NOT_READY YARA exits from yr_scanner_scan_mem_blocks, but you can resume the scan by calling yr_scanner_scan_mem_blocks again. The first thing YARA does in this second call to yr_scanner_scan_mem_blocks is asking the iterator for the block that it could not provide before.

Say I have memory blocks i, i+1, and i+2, and let my YR_MEMORY_BLOCK_ITERATOR return for block i and i+1 the status code `, and for i+2 ERROR_SUCCESS. (I call yr_scanner_scan_mem_blocks for every new block.) I assume that I need to keep the corresponding YR_MEMORY_BLOCK for i and i+1 valid/allocated until the call to yr_scanner_scan_mem_blocks accessing the iterator on block i+2 returns ERROR_SUCCESS.

If your memory block iterator returns ERROR_BLOCK_NOT_READY for block i, then it can't return block i + 1 nor i + 2. As described before, your iterator must return the blocks in order, so, YARA will keep trying to get block i until it succeeds, and your iterator should not return block i + 1 until block i has been successfully returned.

Can I free the blocks i, i+1, i+2 after I called yr_scanner_scan_mem_blocks? I'd like to keep reusing the scanner for subsequent scans. So block i+3, i+4, etc. But I'm not sure if reusing the same scanner requires any form of cross-block random access which would require that all blocks of a scanner remain must remain in memory.

When YARA ask the iterator for block i + 1, block i won't be used anymore and can be released. However one thing must be kept in in mind, YARA can perform multiple iterations over the blocks, I mean, YARA can ask your iterator to provide all the blocks (from the first one to the last one) multiple times. So, the iterator should keep all the blocks in memory until the call yr_scanner_scan_mem_blocks completes the scan (the scan is considered complete whenyr_scanner_scan_mem_blocks returns with any error code except ERROR_BLOCK_NOT_READY). In fact, your iterator must guarantee that after the first iteration, it won't return ERROR_BLOCK_NOT_READY for any block.

Let me know if you have any further question.

mavam commented 8 months ago

Thanks, that helped a lot. I summarized this here again visually:

image

In my tests, I found that I write a rule that spans multiple blocks. In the above rule match spanning 3 blocks, I will get the match result only after that next() of the third block comes with ERROR_SUCCESS. So far so good. Now that I understand that the scanner can perform multiple passes, I wonder the following: Is the assumption that a call to first() is always followed by 0 or more calls to next()?

Let's expand the scenario to this:

image

Here, the match is in the first two blocks, but I get a bunch more blocks afterwards. Is it possible to perform a cumulative scan to yield the match as soon as possible? In this case, already at block 1? AFAICT, if I simply return ERROR_SUCCESS with every block, this won't work because then the blocks are not considered a contiguous sequence anymore.

plusvic commented 8 months ago

In general YARA doesn't guarantee that it will find matches that span multiple blocks. This can happen in certain cases, for example with hex patterns that contain very large jumps, like {01 02 03 04 [-] 05 06 07 08}). This is because such patterns are actually handled as independent ones, here {01 02 03 04} and {05 06 07 08} are searched independently and YARA checks that the former appears before the latter once both are found. But for most patterns YARA won't match them if they cross a block boundary.

mavam commented 8 months ago

But for most patterns YARA won't match them if they cross a block boundary.

So then my assumption was be wrong. It sounds like the only safe way to avoid false negatives would be to merge all blocks before handing them to a scanner and then operating on a single one. (For file-based input, memory-mapping would have the same effect.)

Feel free to close this if you think there's nothing to add. I hope this analysis will be useful for future reference.