Velocidex / go-ntfs

An NTFS file parser in Go
Apache License 2.0
64 stars 23 forks source link

Inode notation insufficient to uniquely identify data streams #78

Closed ydkhatri closed 1 year ago

ydkhatri commented 1 year ago

The current paradigm of accessing data streams via their Attribute IDs is incorrect and results in wrong data being fetched in certain scenarios. The library relies on the inode notation (MftEntry-AttributeType-AttributeID) to uniquely identify data streams. Eg: 38-128-3

However in case of complex MFT entries where an attribute for the main MFT entry is stored in another another one, the attribute ID is no longer unique as it is likely going to be zero in the second one (while in the base entry, id 0 would always be STDINFO). If there are more than one streams, there is no way to distinguish them. This scenario is commonly seen in large fragmented files, most commonly $USNJRNL:$J.

This is not easily noticed as it only occurs if the file is very large and highly fragmented like the $MFT and $USNJRNL:$J. The current implementation partially works, but gives incorrect results that are otherwise difficult to validate without a lot of advanced testing (creation of large and fragmented files!).

The attached disk image charlie.zip demonstrates this problem via a handcrafted MFT entry that simulates the same fragmented behaviour by adding an ATTRIBUTE_LIST and storing some $DATA attributes in other MFT entries. The file /Nine.txt has the following streams: Filename Info
Nine.txt Default $DATA stream, Non-resident
Nine.txt:111 Non-resident Alternate $DATA stream
Nine.txt:222 Resident Alternate $DATA stream
Nine.txt:333 Non-resident Alternate $DATA stream

The picture below shows the MFT layout of this file.

image

It is not possible to access the streams 111 and 333 via the inode notation. To access alternate data streams, we need to address them by name. In NTFS, no other unique identifier exists.

Bug demo

Below is a listing of the contents and file sizes of the streams in Nine.txt. Filename Size Content
Nine.txt 5000 9999999999999999...\<snipped>
Nine.txt:111 5005 "111111111111111111...\<snipped>
Nine.txt:222 56 "222222222222222...\<snipped>
Nine.txt:333 6005 "33333333333333...\<snipped>

To demonstrate the issue, we run the compiled exe against this image on the entry Nine.txt.

C:\go-ntfs>ntfs.exe cat "C:\temp\vr\tests\images\charlie_edited.dd” Nine.txt
"111111111111111111..._<snipped>_
scudette commented 1 year ago

Thanks for this detailed report ! I identified the part in TSK that assigns ID to the attribute

https://github.com/sleuthkit/sleuthkit/blob/820b18589f1d86de6f33affd935cabe88b94580f/tsk/fs/ntfs.c#L1899

It looks like it just makes up an ID and stores it in a map to ensure the ID is unique. We could do the same thing to fix this issue.

There are two options:

  1. expand the API as you did to include the stream name:

    • pro: The id indicated is what the disk actually says - more forensically sound since the TSK Id can be randomly assign if one goes back to a hex editor they might be surprised
    • con: More complex API - this makes it also leak into the VQL because now we need to include the stream name in the inode description.
  2. Emulate the way TSK does it

    • pro: Keep a simpler API, maybe compatible with the randomly assigned IDs that TSK uses.
    • con: Since the attribute ID is randomly assigned it is not consistent with the disk bytes which can be surprising

Unlike in the TSK the VQL "inode" notation is actually a free form string and I think we really only care about the way the VQL interacts with the library - we have no external API stability requirement so it may not be terrible to extend the API as needed, as long as we pass the "inode" string back with sufficient information to uniquely identify the stream.

In this PR we extend the API to include the stream name so we could return an inode of the form 38-128-0-111 or 38-128-0-333. The problem with this approach is that we now have filename encoding issues in the inod string 38-128-0-this is a long name - maybe it is not a big deal?

Alternatively we can do what TSK does and come up with a constant identifier (i would tend to use the stream offset of count rather than a randomly assigned number) so something like 38-128-0-345 making it clear that it is a different id 0 stream from 38-128-0-232 for example.