VirusTotal / yara

The pattern matching swiss knife
https://virustotal.github.io/yara/
BSD 3-Clause "New" or "Revised" License
8.08k stars 1.43k forks source link

LNK Module #1957

Closed BitsOfBinary closed 1 year ago

BitsOfBinary commented 1 year ago

Reopening to fix tests from previous PR: https://github.com/VirusTotal/yara/pull/1732

BitsOfBinary commented 1 year ago

@plusvic I can see in the builds that it doesn't look like the bigendian one is actually compiling the LNK module. I wonder if that's because the bigendian tests only run on the main branch of YARA? https://github.com/VirusTotal/yara/blob/c41906bb13930137dd6f40ff8167faf5772ca452/.github/workflows/build.yml#L143C17-L143C17

BitsOfBinary commented 1 year ago

The following build shows that all tests now pass as expected for big endian: https://github.com/VirusTotal/yara/actions/runs/5967961527/job/16190739595?pr=1957

I have reverted the changes back from the build YAML, and expect all tests to pass as expected now. Re-requesting review from @plusvic , thank you for your patience!

plusvic commented 11 months ago

@BitsOfBinary while porting the LNK module to Rust I've found that the data exposed by this module is a bit cumbersome and hard to use. For instance, there are fields that expose the size of internal data structures that in most files should have the same value. That's the case of lnk.link_info.size and lnk.link_info.volume_id.size for example.

The same occurs with offset fields, which are useful while parsing the file for locating other structures and strings, but are probably not very relevant from the standpoint of a YARA rule creator. That's the case of:

..etc

Let's use lnk.link_info.volume_id.volume_label_offset as an example. This offset is not relative to the start of the file, it's relative to the start of the VolumeID structure, which can lead to confusion, as most users will think that volume_label_offset is an offset relative to the start of the scanned file. The same happens with all offset fields, as most of them are relative to the structure that contains them. That's not very useful when creating a YARA rule.

Moreover, some strings that may be very useful while creating YARA rules are exposed as an array of raw bytes, which often includes the null-terminator. For example, in the snippet below you can see that the value for the common_path_suffix field is "\x00". In reality this is an empty string, it's just that the null-terminator is included as part of the string, but it would be more user-friendly if this string would have the value "" instead of "\x00". Something similar happens with link_info.volume_id.data, which also have the value "\x00". This link_info.volume_id.data field is also hard to use. It contain raw data which may be either an ASCII or unicode string, and the data actually corresponds to the volume label, but that's not clear enough because of the data name.

    link_info
        common_path_suffix_unicode = YR_UNDEFINED
        local_base_path_unicode = YR_UNDEFINED
        common_path_suffix = "\x00"
        common_network_relative_link
            device_name_unicode = YR_UNDEFINED
            net_name_unicode = YR_UNDEFINED
            device_name = YR_UNDEFINED
            net_name = YR_UNDEFINED
            device_name_offset_unicode = YR_UNDEFINED
            net_name_offset_unicode = YR_UNDEFINED
            network_provider_type = YR_UNDEFINED
            device_name_offset = YR_UNDEFINED
            net_name_offset = YR_UNDEFINED
            flags = YR_UNDEFINED
            size = YR_UNDEFINED
        has_common_network_relative_link = YR_UNDEFINED
        local_base_path = "C:\test\a.txt"
        volume_id
            data = "\x00"
            volume_label_offset_unicode = YR_UNDEFINED
            volume_label_offset = 16
            drive_serial_number = 813337217
            drive_type = 3
            size = 17
        has_volume_id = 1
        common_path_suffix_offset_unicode = YR_UNDEFINED
        local_base_path_offset_unicode = YR_UNDEFINED
        common_path_suffix_offset = 59
        common_network_relative_link_offset = 0
        local_base_path_offset = 45
        volume_id_offset = 28
        flags = 1
        header_size = 28
        size = 60

Overall, my feeling is that this module is a low-level representation of a .lnk file, which exposes too many details that are not really useful for creating YARA rules, while the really relevant information is hard to access and work with. I like the idea of having a lnk module, but I think this requires some redesign and simplification before being published.

I would like to get some feedback from people who are already using the module or are interested in creating YARA rules for .lnk files.

plusvic commented 10 months ago

The deeper I delve into this matter, the more convinced I become that postponing the release of this module and waiting for the Rust implementation in YARA-X is the wiser choice. The intricacies of the LNK format far surpass initial expectations, presenting significant challenges for proper implementation in plain C. Notably, one of the complexities arises from the fact that nearly every string field in an LNK file can be presented in either ANSI or UTF-16 format. This has led to the creation of pairs of fields such as local_base_path and local_base_path_unicode, darwin_data_ansi and darwin_data_unicode, among others.

Both local_base_path and local_base_path_unicode essentially convey the same information, with the only difference being the encoding format. The local base path may exist either as an ANSI or Unicode string. Consequently, if someone wishes to create a rule that searches for a specific local base path, regardless of whether the LNK file uses Unicode or not, they would be compelled to construct a condition like this:

local_base_path == "foo" or
local_base_path_unicode == "foo"

To compound matters further, the value in local_base_path_unicode would be the raw representation of the UTF-16 string, leading to a condition more like this:

local_base_path == "foo" or
local_base_path_unicode == "f\x00o\x00o\x00"

If your condition is simply local_base_path == "foo", but the local base path is in Unicode format, your rule will fail to match. This intricate interplay between the two string variants creates a usability nightmare, making the module more challenging to use and cluttered with numerous fields.

To enhance the user-friendliness of the module, we should consider concealing this complexity from the end user by presenting a single local_base_path field that consistently contains a UTF-8 string representation. Regardless of whether the original data is in ANSI or Unicode format, this field will consistently offer the UTF-8 representation of the string. While achieving this in Rust, which has inherent support for handling UTF-16 strings, is relatively straightforward, doing so in plain C proves to be a more daunting task.

BitsOfBinary commented 10 months ago

Thank you for the detailed comments @plusvic . I agree on all the sentiments about wanting to make this module as user-friendly as possible.

From your above comments, there are several things that can be done to achieve this, which is to:

That last point is probably the most complex out of these objectives, as you highlight. But perhaps the next draft of the module that can satisfy the above points will make it still worthy of inclusion in YARA itself vs. only applying it to YARA-X.

I'm really glad you're doing a Rust implentation, as I'm convinced that a LNK module in YARA will be very beneficial (threat actors continue to use the file format, and imo the sooner we have a better way of parsing these files in YARA the better) - but I equally think you're right in making sure this module is as easy to use as possible.

I will follow whatever you suggest when it comes to postponing the module or not, but given the time gap in months between previous pre-releases and official releases, I think we can potentially still get this module in a good state before v4.4.0 is released if you agree.

plusvic commented 10 months ago

After carefully weighing the pros and cons, I have made the decision to revert this PR and exclude the lnk module from the upcoming 4.4.0 release. Allow me to summarize the key factors that have led me to this conclusion:

Rework Challenges: The lnk module necessitates significant revisions, as pointed out in previous comments. However, some of these changes are not straightforward within the C codebase and may demand a substantial amount of effort, particularly with regard to handling UTF-16 strings correctly.

Compatibility Concerns: Maintaining compatibility between the Rust and C versions is anticipated to be a considerable challenge. This challenge extends beyond situations where an LNK file is successfully parsed; it also applies when parsing fails due to malformed files. Ideally, we would extract as much information as possible from the LNK file while bypassing corrupted structures. Nevertheless, ensuring that both versions behave consistently in such scenarios is exceptionally difficult and may not warrant the effort, potentially introducing some level of incompatibility between the two implementations.

Adoption Considerations: Unlike other existing modules, this particular module has not gained significant adoption. Therefore, by postponing its publication and releasing only the Rust version, we can prevent potential compatibility issues in the future.

Parsing Complexity: Parsing binary formats, especially in C, is a complex task. Based on my experience, the initial implementation almost always contains bugs that can cause YARA to crash when parsing corrupted files. If we were to publish the C version of the lnk module, we would commit to maintaining it in the future. However, I am hesitant to make that commitment at this time, as my primary focus is on the Rust implementation.

I want to express that this decision was not made lightly. I genuinely appreciate the efforts you have invested in implementing the lnk module. Your work has not gone unnoticed, and it has played a crucial role in motivating me to port the module to Rust. Also, reverting the decision of not publishing the lnk module is going to be easier that reverting the decision of publishing it once it starts being used.

Thank you for your understanding, and please know that your contributions are valued and you can keep contributing with the Rust version of the module if you wish.