Add LZ77 Compression - Githubissues

BurntNail commented 3 weeks ago

Summary by CodeRabbit

Release Notes

New Features
- Introduced Lempel-Ziv (LZ) and Run-Length Encoding (RLE) compression algorithms.
- Added JSON-based serialization/deserialization for the Store object.
- Enhanced Value enum to support serialization/deserialization of Ipv4Addr and Ipv6Addr.
Bug Fixes
- Improved error handling in the deser method for the Value enum.
Documentation
- Added comments and improved documentation for various methods across the codebase.
Chores
- Updated benchmark configurations and added new benchmarks for compression algorithms.

coderabbitai[bot] commented 3 weeks ago

[!WARNING]

Rate limit exceeded

@BurntNail has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 12 minutes and 12 seconds before requesting another review.

⌛ How to resolve this issue?
After the wait time has elapsed, a review can be triggered using the `@coderabbitai review` command as a PR comment. Alternatively, push new commits to this PR. We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?
CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our [FAQ](https://coderabbit.ai/docs/faq) for further information.

📥 Commits
Reviewing files that changed from the base of the PR and between b003c75d00d966014a0f09ac234c55a51dbec970 and 97e7ec237c37fd7e86e98aa9e9303ca01a4e7a9e.

📝 Walkthrough

## Walkthrough The pull request introduces several updates across various files in the `sourisdb` crate. Key changes include modifications to benchmarking functions, the introduction of new compression algorithms (Lempel-Ziv and Run-Length Encoding), and updates to serialization and deserialization methods for various types. The `Cargo.toml` file has been updated to include new dependencies and benchmarks. Additionally, improvements in error handling and documentation have been made, particularly for network address types. Overall, the changes focus on enhancing the clarity, modularity, and functionality of the codebase. ## Changes | File Path | Change Summary | |---------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `crates/sourisdb/benches/utilities.rs` | Updated benchmarking functions to use `EXAMPLE_DATA`, removed `rle_and_un_rle` function. | | `crates/sourisdb/src/axum.rs` | Added TODO comment for Huffman compression in `IntoResponse` implementation for `Value`. | | `crates/sourisdb/src/types/binary.rs` | Introduced `lz` and `rle` modules, updated `BinaryCompression` enum, modified `BinaryData` methods for new compression logic, removed test module. | | `crates/sourisdb/src/types/binary/lz.rs` | Implemented Lempel-Ziv compression and decompression functions, added test module. | | `crates/sourisdb/src/types/binary/rle.rs` | Implemented Run-Length Encoding compression and decompression functions, added test module. | | `crates/sourisdb/src/types/integer.rs` | Expanded test cases for `Integer` struct serialization/deserialization. | | `crates/sourisd/src/v1_routes/state.rs` | Refactored `SourisState` methods for clarity, added `Entry` import, improved cache consistency logic. | | `crates/sourisdb/Cargo.toml` | Added new benchmark for compression and new dependency `lz4_flex`. | | `crates/sourisdb/benches/compression.rs` | Introduced benchmarks for RLE and LZ serialization/deserialization using Criterion. | | `crates/sourisdb/benches/database.rs` | Updated benchmarks to transition from binary to JSON-based serialization/deserialization for `Store`. | | `crates/sourisdb/src/store.rs` | Organised imports, minor adjustments to `ser` and `deser` methods, no functional changes. | | `crates/sourisdb/src/utilities/bits.rs` | Minor whitespace changes in documentation comments for `deser` method, no functional changes. | | `crates/sourisdb/src/values.rs` | Enhanced `convert_to_json` and `deser` methods to support `Ipv4Addr` and `Ipv6Addr`, improved error handling. | | `crates/sourisdb/proptest-regressions/types/binary/lz.txt` | Added new text file for Proptest failure case seeds. | | `crates/sourisdb/src/utilities/huffman.rs` | Refactored `encode` method for improved readability, no changes to `decode` method. | ## Sequence Diagram(s) ```mermaid sequenceDiagram participant User participant Benchmark participant Compression participant Store User->>Benchmark: Run Compression Benchmark Benchmark->>Compression: Call RLE Serialization Compression-->>Benchmark: Return RLE Serialized Data Benchmark->>Compression: Call LZ Serialization Compression-->>Benchmark: Return LZ Serialized Data Benchmark->>Store: Deserialize RLE Data Store-->>Benchmark: Return RLE Object Benchmark->>Store: Deserialize LZ Data Store-->>Benchmark: Return LZ Object Benchmark-->>User: Return Benchmark Results ```

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai) - [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai) - [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai) - [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)

🪧 Tips

### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit , please review it.` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (Invoked using PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. ### Other keywords and placeholders - Add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. - Add `@coderabbitai summary` to generate the high-level summary at a specific location in the PR description. - Add `@coderabbitai` anywhere in the PR title to generate the title automatically. ### CodeRabbit Configuration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://coderabbit.ai/docs) for detailed information on how to use CodeRabbit. - Join our [Discord Community](http://discord.gg/coderabbit) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.

BurntNail commented 3 weeks ago

@coderabbitai review

coderabbitai[bot] commented 3 weeks ago

:white_check_mark: Actions performed

Review triggered. > Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

BurntNail commented 3 weeks ago

@coderabbitai - any thoughts on optimising the lz function?

coderabbitai[bot] commented 3 weeks ago

[!TIP] For best results, initiate chat on the files or code changes.

@BurntNail, after reviewing the lz function, I have some suggestions for optimizing it:

Optimize the subslice_in_slice Function:
- The current implementation of subslice_in_slice uses a naive approach with slice.windows(subslice.len()).position(|window| window == subslice), which has a time complexity of O(n*m) for each call, where n is the length of the slice and m is the length of the subslice.
- Suggestion: Implement a more efficient search algorithm, such as the Knuth-Morris-Pratt (KMP) algorithm or use hashing techniques to improve the search efficiency.
```
fn subslice_in_slice(subslice: &[u8], slice: &[u8]) -> Option<usize> {
   slice.find(subslice)
}
```
- In Rust, you can use the find method from the twoway crate, which provides a fast substring search.
Use a Hash Map for Faster Lookups:
- Maintain a hash map that stores sequences of bytes along with their positions within the sliding window. This allows for constant-time lookups when searching for matches.
```
use std::collections::HashMap;

let mut sequence_map: HashMap<&[u8], usize> = HashMap::new();
// Update the map as you progress through the data
```
Limit the Length of Matches:
- Set a maximum match length to prevent the algorithm from spending too much time searching for very long matches, which have diminishing returns on compression ratio.
```
const MAX_MATCH_LENGTH: usize = 258; // Commonly used in implementations like DEFLATE
```
Refactor the Main Loop:
- The logic within the for loop is quite complex, which can impact readability and performance.
- Suggestion: Refactor the loop into smaller functions or use a while loop to have better control over the indices.
Implement a More Efficient Data Structure:
- Use a sliding window buffer with an efficient data structure like a trie or suffix tree for pattern searching.
- Suggestion: Libraries like bstr or fst can be helpful for fast substring searching.
Switch to LZSS Compression:
- The comment in the code mentions using LZSS instead of LZ77. LZSS outputs literals and pointers, which can result in better compression and performance.
- Suggestion: Modify the algorithm to output a literal when no adequate match is found, reducing the overhead of storing unnecessary tokens.
Avoid Redundant Calls to subslice_in_slice:
- Cache the results of previous searches when possible to prevent redundant computations.
Optimize Memory Usage:
- Ensure that you're not making unnecessary allocations or clones of data.
- For example, avoid using to_vec() in output.extend(output[start..end].to_vec()); and instead use output.extend_from_slice(&output[start..end]);.
Parallel Processing:
- If dealing with large datasets, consider parallelizing the compression process using Rust's concurrency primitives or libraries like rayon.
Benchmark and Profile:
- Use benchmarking tools like cargo bench to identify bottlenecks.
- Profile the code to see where most of the time is spent and focus optimization efforts there.

Example Refactored Function:

Here's a simplified version of the lz function using some of the suggestions:

pub fn lz_optimized(bytes: &[u8]) -> Vec<u8> {
    use std::collections::HashMap;
    const MAX_MATCH_LENGTH: usize = 258;
    let mut output = vec![];
    let mut pos = 0;
    let mut window = HashMap::new();

    while pos < bytes.len() {
        let max_length = bytes.len() - pos;
        let mut match_length = 0;
        let mut match_distance = 0;

        for length in (2..=MAX_MATCH_LENGTH.min(max_length)).rev() {
            if let Some(&prev_pos) = window.get(&bytes[pos..pos + length]) {
                match_length = length;
                match_distance = pos - prev_pos;
                break;
            }
        }

        if match_length >= 2 {
            // Output a token
            output.push(/* Token header */);
            output.extend(&match_distance.to_le_bytes());
            output.extend(&match_length.to_le_bytes());
            pos += match_length;
        } else {
            // Output a literal byte
            output.push(bytes[pos]);
            pos += 1;
        }

        // Update the window
        window.insert(&bytes[pos.saturating_sub(MAX_SLIDING_WINDOW_SIZE)..pos], pos);
    }

    output
}

Note: The above code is a rough example to illustrate the optimization concepts. It may require further adjustments to integrate with your existing codebase and ensure correctness.

Let me know if you have any questions or need further assistance with these optimizations!

🔗 Analysis chain

🏁 Scripts executed

The following scripts were executed for the analysis: Script: ```shell #!/bin/bash # Display the contents of the lz.rs file cat crates/sourisdb/src/types/binary/lz.rs ``` Length of output: 5675

BurntNail commented 3 weeks ago

@coderabbitai review

edit: slay this got ratelimited :)))))

coderabbitai[bot] commented 3 weeks ago

:white_check_mark: Actions performed

Review triggered. > Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

BurntNail commented 3 weeks ago

@coderabbitai review

coderabbitai[bot] commented 3 weeks ago

:white_check_mark: Actions performed

Review triggered. > Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai[bot] commented 3 weeks ago

@BurntNail, I've triggered a review of the changes.

:white_check_mark: Actions performed

Review triggered. > Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

BurntNail commented 3 weeks ago

@coderabbitai summary

coderabbitai[bot] commented 3 weeks ago

:white_check_mark: Actions performed

Summary regeneration triggered.

BurntNail commented 3 weeks ago

@coderabbitai full review

coderabbitai[bot] commented 3 weeks ago

:white_check_mark: Actions performed

Full review triggered.

BurntNail / Souris

Add LZ77 Compression #3

Summary by CodeRabbit

Release Notes

Rate limit exceeded