Velocidex / go-ntfs

An NTFS file parser in Go
Apache License 2.0
64 stars 22 forks source link

Add Attribute name to reference ADS explicitly #79

Closed ydkhatri closed 1 year ago

ydkhatri commented 1 year ago

This should fix #78 , issues properly accessing ADS. Primarily a change in function OpenStream(..). It now accepts an attr_name variable in addition to the attribute type and id, so we can explicitly state the stream name to focus on.

Adds functionality to target specific ADS by name in upper level functions that use OpenStream. Also fixed another bugs that caused issues with reading $USNJRNL:$J and other SPARSE files.

CLAassistant commented 1 year ago

CLA assistant check
All committers have signed the CLA.

scudette commented 1 year ago

Thanks for the thorough testing and analysis of this bug.

My understanding is that the issue is that we identify the id of the stream based on the extended ID but really we need to identify it based on the reconstructed id right?

In your test image I can open the 3 file with icat using icat charlie.dd 38-128-12 so this looks like how the inode is interpreted in tsk. Could we interpret it the same way too?

I dont know if expending the API to include the stream name is going to work because we also need to open other streams which are not $DATA for example $EA and others. Do they always have a $NAME field?

ydkhatri commented 1 year ago

I have not researched $EA, so not sure about how they work or if they have similar issues as ADS. Also, I'd have to study TSK to see how they are implementing the inode notation.

ydkhatri commented 1 year ago

That last commit is a slightly different bug fix, related to parsing of SPARSE streams, which leads to wrong data (usually lesser data) being pulled back from $USNJRNL like streams.

scudette commented 1 year ago

I updated the code and added the test for the image you sent. There is still something that is not quite right:

I usually see ID of 0 in data streams which are split across multiple VCNs and in your image the two ADS have an id of 0 - usually smaller files that are not split into multiple VCNs do not have id of 0. But when the stream become larger, then it gets split into multiple streams with the same mft id, type and id =0 and probably also the same ads name. So this is still not unique and probably we need a better algorithm to combine those.

The difficulty is that we also use a required_id of 0 to mean the first available stream (for example when we just cat by the mft id the attribute ID is assumed 0). In that case we generally want to read the first non ADS stream. But in this image the first data stream has an of 0 and an ADS of 111.

So in this image is we do a cat on inode "38" we will get 38-128-0:111 and not 38-128-3 (which is the first stream without an ads). This might be an acceptable confusion if it is unlikely that an ADS stream will have an id of 0 (or if the ADS is very fragmented) In that case the user will need to specify the full inode (include the stream name) to get the correct stream.

How did you generate the sample image? is it reflective of real world images?

Maybe we need to change the meaning of the "default attr ID" from 0 to something else (say -1) if 0 is a real stream id?

ydkhatri commented 1 year ago

The MFT attribute id number is just a sequentially increasing number in NTFS, and has no bearing on anything else. The order in which streams are created can influence this, as can their shifting to new MFT entries. The id is not meant to represent unique data streams and hence trying to artificially make it do so creates issues. We can't make any assumptions about what the id can or would be.

As suggested by yourself, it's better to use a default id of -1 instead of 0. That would be the correct approach and clear all confusion. I'd prefer this.

Also, in my suggested fix, in GetAllVCNs() in easy.go, I've filtered for the correct stream name when passed a stream name, else, filter out all named streams. So the behaviour is either fetch VCNs for a named stream only, or (if no name specified), just get VCNs for the unnamed (single default) data stream.

Edit: I just tested "cat 38" on my code and it correctly exhibits above behaviour to print out the correct data.

ydkhatri commented 1 year ago

Regarding the generated file, yes, it is indicative of real world entries, except as you've said, it only naturally happens for large files (allocation of attribute_lists). I've studied many such large fragmented files, then just created this entry based on that. This one is hand crafted to simulate the same on a very small file, but a perfectly valid one, and NTFS can read and write to it with no issues.

ydkhatri commented 1 year ago

But thanks for quickly looking into this. :)

scudette commented 1 year ago

This is the issue - if required_data_attr_name is not specified if should return a stream without a name if there is such a stream or the next stream with the first attribute name. This happens for example for 9-128-8 ($Secure:$SDS) which has no un-named stream at all. So if the user asks to open inode 9 it should find that.

On the other hand when the user asks to open inode 38 (meaning wildcard id and wildcard name) - it should return 38-128-3 (the non ads stream)

scudette commented 1 year ago

Ok I implemented a more complete algorithm I think. The stream ID of 0 is now not considered special. The algorithm is as follows:

  1. First select which attribute we want based on type, id, name (with id and name can have wildcards)
  2. Once the first attribute is found we combine other attributes with the same id into a VCN set.

I will do more testing now and see how it goes

scudette commented 1 year ago

Looking at one of my other images I forgot about has a fragmented MFT looking like this

$FILE_NAME Attribute Values:
Flags: Hidden, System
Name: $MFT
Parent MFT Entry: 5     Sequence: 5
Allocated Size: 16384       Actual Size: 16384
Created:    2022-08-21 13:59:17.434687300 (Pacific Daylight Time)
File Modified:  2022-08-21 13:59:17.434687300 (Pacific Daylight Time)
MFT Modified:   2022-08-21 13:59:17.434687300 (Pacific Daylight Time)
Accessed:   2022-08-21 13:59:17.434687300 (Pacific Daylight Time)

$ATTRIBUTE_LIST Attribute Values:
Type: 16-0  MFT Entry: 0    VCN: 0
Type: 48-3  MFT Entry: 0    VCN: 0
Type: 128-6     MFT Entry: 0    VCN: 0
Type: 128-0     MFT Entry: 15   VCN: 1604054
Type: 176-0     MFT Entry: 16   VCN: 0
Type: 176-0     MFT Entry: 17   VCN: 192

The weird thing about this one is that it starts with 5-128-6 for the first VCN but then it goes to 5-128-0 for the second VCN (i.e. the second one has an id of 0 but it is still part of the first VNC). I think it might be more reliable to associate those using the VCN values instead of the id

ydkhatri commented 1 year ago

This is the issue - if required_data_attr_name is not specified if should return a stream without a name if there is such a stream or the next stream with the first attribute name. This happens for example for 9-128-8 ($Secure:$SDS) which has no un-named stream at all. So if the user asks to open inode 9 it should find that.

On the other hand when the user asks to open inode 38 (meaning wildcard id and wildcard name) - it should return 38-128-3 (the non ads stream)

For this instance, I believe you should return nothing if no stream name is specified. So $Secure (ie, just asking for inode 38) should return nothing. But $Secure:$SDS should return the associated stream.

I don't think there is a need to support returning any random (first encountered) stream if there is no default unnamed stream in a file. Named streams have special purpose and the caller must specify which one they need. There is no benefit to returning a random stream in my opinion. For example, one may ask for $USNJRNL and get either the $Max or the $J stream returned (randomly depending on which stream happens to occur first). Users should be discouraged from making assumptions about which stream may occur first, as this is not always the same. I'd rather return no data if no unnamed streams present (and one wasn't specified), which would be more aligned to NTFS's default behaviour. This also keeps the implementation simpler and less error-prone.

ydkhatri commented 1 year ago

Looking at one of my other images I forgot about has a fragmented MFT looking like this

$FILE_NAME Attribute Values:
Flags: Hidden, System
Name: $MFT
Parent MFT Entry: 5     Sequence: 5
Allocated Size: 16384       Actual Size: 16384
Created:    2022-08-21 13:59:17.434687300 (Pacific Daylight Time)
File Modified:  2022-08-21 13:59:17.434687300 (Pacific Daylight Time)
MFT Modified:   2022-08-21 13:59:17.434687300 (Pacific Daylight Time)
Accessed:   2022-08-21 13:59:17.434687300 (Pacific Daylight Time)

$ATTRIBUTE_LIST Attribute Values:
Type: 16-0  MFT Entry: 0    VCN: 0
Type: 48-3  MFT Entry: 0    VCN: 0
Type: 128-6     MFT Entry: 0    VCN: 0
Type: 128-0     MFT Entry: 15   VCN: 1604054
Type: 176-0     MFT Entry: 16   VCN: 0
Type: 176-0     MFT Entry: 17   VCN: 192

The weird thing about this one is that it starts with 5-128-6 for the first VCN but then it goes to 5-128-0 for the second VCN (i.e. the second one has an id of 0 but it is still part of the first VNC). I think it might be more reliable to associate those using the VCN values instead of the id

Yes, this is not surprising, and the point I was trying to make earlier. The ID (or it's progression) is not something that can be used to reliably track a DATA stream.

scudette commented 1 year ago

Thanks. In the current iteration of the code the vcns are matched exactly so I think it's more reliable for this case.

Do you have some more images you can use for testing these edge cases?