bergercookie / asm-lsp

Language server for NASM/GAS/GO Assembly
https://crates.io/crates/asm-lsp
BSD 2-Clause "Simplified" License
272 stars 18 forks source link

get_str and get_string Are Unsound Due to Unchecked UTF-8 Operations #187

Closed lwz23 closed 1 day ago

lwz23 commented 2 days ago

Description The provided get_str and get_string functions use unsafe blocks to convert byte slices (&[u8]) and vectors (Vec) into Rust strings without validating that the input is valid UTF-8. While the documentation states that performance is critical and inputs are "mostly ASCII," this assumption does not guarantee safety. As a result, these functions are unsound because they can invoke undefined behavior (UB) if the input contains invalid UTF-8. https://github.com/bergercookie/asm-lsp/blob/4b8bc91aedf7498980aea4667a5c70c5d9f81f83/asm-lsp/ustr.rs#L12

#[must_use]
pub const fn get_str(v: &[u8]) -> &str {
    unsafe { str::from_utf8_unchecked(v) }
}

#[must_use]
pub fn get_string(v: Vec<u8>) -> String {
    unsafe { String::from_utf8_unchecked(v) }
}

Problems: Unchecked Assumptions:

Both str::from_utf8_unchecked and String::from_utf8_unchecked require that the input is valid UTF-8. If this requirement is violated, the program may invoke undefined behavior. The assumption that inputs are "mostly ASCII" is not sufficient to ensure safety. Even a single invalid UTF-8 byte can cause UB. Undefined Behavior:

Passing invalid UTF-8 to these functions leads to undefined behavior. This can result in data corruption, crashes, or security vulnerabilities, especially if inputs come from external sources. Misleading Safety:

Declaring these functions as pub but not marking them as unsafe suggests that they are safe to use. This hides the requirement that the caller must ensure the input is valid UTF-8, which is a critical safety contract. Potential Exploitation:

These functions can easily be misused by other developers who might assume they are safe for any input, especially in environments where inputs are user-controlled. Suggested Fix: To align with Rust's safety guarantees while respecting the performance concerns mentioned in the documentation, the following changes are recommended:

Mark Functions as unsafe: Add the unsafe keyword to both get_str and get_string to make it explicit that these functions rely on the caller to uphold safety invariants (i.e., ensuring the input is valid UTF-8).

/// Converts a byte slice to a string slice without checking for valid UTF-8.
///
/// # Safety
/// The caller must ensure that the input slice is valid UTF-8. Passing invalid UTF-8
/// to this function invokes undefined behavior.
#[must_use]
pub const unsafe fn get_str(v: &[u8]) -> &str {
    str::from_utf8_unchecked(v)
}

/// Converts a byte vector to a string without checking for valid UTF-8.
///
/// # Safety
/// The caller must ensure that the input vector is valid UTF-8. Passing invalid UTF-8
/// to this function invokes undefined behavior.
#[must_use]
pub unsafe fn get_string(v: Vec<u8>) -> String {
    String::from_utf8_unchecked(v)
}

Additional Context: Using unsafe for performance-critical operations is valid, but it must be paired with clear and explicit contracts to ensure that users of the function understand and uphold the necessary invariants. Marking get_str and get_string as unsafe reflects the true nature of their requirements and helps prevent misuse, while still allowing for high performance in controlled environments. Here is the definition of soundness in Rust: Soundness is a type system concept (actually originating from the study of logics) and means that the type system is "correct" in the sense that well-typed programs actually have the desired properties. For Rust, this means well-typed programs cannot cause Undefined Behavior. This promise only extends to safe code however; for unsafe code, it is up to the programmer to uphold this contract.

Accordingly, we say that a library (or an individual function) is sound if it is impossible for safe code to cause Undefined Behavior using its public API. Conversely, the library/function is unsound if safe code can cause Undefined Behavior.

lwz23 commented 2 days ago

I understand that efficiency is important for these two functions, but I strongly recommend declaring them unsafe for security reasons :)

WillLillis commented 1 day ago

These functions are intended for internal use only. get_str is used only on a set of pre-validated input files, and get_string is used when interacting with output from the host operating system. I've removed their public status and added your suggested doc comments in https://github.com/bergercookie/asm-lsp/pull/188 so that no one mistakenly uses these functions. I don't think marking them as unsafe is necessary for internal-only use, though