Bug when changing from platform tools v1.37 to v1.39

LucasSte commented 5 months ago

Problem

When building Openbook-V2 with the tools version v1.37, the following steps work correctly, but fail in platform tools version v1.39.

Steps to reproduce:

git clone https://github.com/openbook-dex/openbook-v2
cargo build-sbf --features enable-gpl
solana program deploy ./target/deply/openbook_v2.so
yarn run ts-node test.ts

import { Connection, Keypair, LAMPORTS_PER_SOL, PublicKey, Transaction } from '@solana/web3.js';
import * as openbook from '@openbook-dex/openbook-v2';
import { Program, BN, AnchorProvider, Wallet } from '@coral-xyz/anchor';
import * as splToken from "@solana/spl-token";

import * as fs from 'fs';

async function createMint(connection : Connection,  authority: Keypair, nb_decimals = 6) : Promise<PublicKey> {
    const kp = Keypair.generate();
    return await splToken.createMint(connection, 
        authority, 
        authority.publicKey, 
        authority.publicKey, 
        nb_decimals,
        kp)
}

export async function main() {

    const secretKey = JSON.parse(fs.readFileSync("/home/galactus/.config/solana/id.json", "utf-8"));
    const keypair = Keypair.fromSecretKey(new Uint8Array(secretKey));
    const authority = keypair;
    const payer = authority;
    const connection = new Connection("https://api.testnet.rpcpool.com/dfeb84a5-7fe8-4783-baf9-60cca0babbc7", "processed");

    let airdrop_sig = await connection.requestAirdrop(authority.publicKey, 2 * LAMPORTS_PER_SOL);
    await connection.confirmTransaction(airdrop_sig);

    let baseMint = await createMint(connection, authority, 6);
    let quoteMint = await createMint(connection, authority, 6);

    const quoteLotSize = new BN(1000000);
    const baseLotSize = new BN(1000000000);
    const makerFee = new BN(0);
    const takerFee = new BN(0);
    const timeExpiry = new BN(0);

    const wallet = new Wallet(authority);

    const programId = new PublicKey("AiqQtnazKRRUkn9enZ9SLUy35FS5aT38QrBET3qCiPqF");
    const provider = new AnchorProvider(connection, wallet, {});
    let client = new openbook.OpenBookV2Client( provider, programId);

    // Add your test here.
    const [[bidIx, askIx, eventHeapIx, ix], [market, bidsKeypair, askKeypair, eventHeapKeypair]] = await client.createMarketIx(
      authority.publicKey,
      "Market Name",
      quoteMint,
      baseMint,
      quoteLotSize,
      baseLotSize,
      makerFee,
      takerFee,
      timeExpiry,
      null, // oracleA
      null, // oracleB
      null, // openOrdersAdmin
      null, // consumeEventsAdmin
      null, // closeMarketAdmin
    );
    console.log("sending tx");

    let tx = new Transaction();
    tx.add(bidIx);
    tx.add(askIx);
    tx.add(eventHeapIx);
    tx.add(ix);
    tx.recentBlockhash = (await connection.getLatestBlockhash()).blockhash;
    // Send transaction
    let sig = await connection.sendTransaction(tx, [authority, market, bidsKeypair, askKeypair, eventHeapKeypair], {
        skipPreflight: false
    });
    console.log('Your transaction signature', sig);
    await connection.confirmTransaction(sig);

    console.log("Market initialized successfully");
    console.log("Market account:", market.publicKey.toBase58());
    console.log("Bids account:", bidsKeypair.publicKey.toBase58());
    console.log("Asks account:", askKeypair.publicKey.toBase58());
    console.log("Event heap account:", eventHeapKeypair.publicKey.toBase58());
    // console.log("Market authority:", market.authority.toBase58());
    console.log("Quote mint:", quoteMint.toBase58());
    console.log("Base mint:", baseMint.toBase58());
    console.log("Quote lot size:", quoteLotSize.toString());
    console.log("Base lot size:", baseLotSize.toString());
}

main().then(x => {
    console.log('finished sucessfully')
}).catch(e => {
    console.log('caught an error : ' + e)
})

Error in v1.39:

Program 9QJrVWzEaZBjao31iqBNaGqmXUNim7tmdb9kgczqGQXD failed: Access violation in unknown section at address 0x0 of size 8'

Proposed Solution

Investigating the problem

acheroncrypto commented 5 months ago

Anchor has a bunch of tests that fail after upgrading to 1.18 CLI, with the main difference coming from platform-tools v1.37 vs v1.39 (https://github.com/coral-xyz/anchor/pull/2795#issuecomment-1925670686).

The tests work as long as the program is built using an earlier version than v1.39, independent of solana-cli, test-validator or solana-program version used.

LucasSte commented 5 months ago

As an update for this issue, the pyth tests failures mentioned in https://github.com/coral-xyz/anchor/pull/2795#issuecomment-1925670686 by @acheroncrypto are caused by the change in the minimum size for enums in Rust. I've fixed this bug in https://github.com/anza-xyz/rust/pull/90. I ran the test with the fix and everything turned out green.

The OpenBook-V2 failure has been consuming more time, as it is a large contract that also depends on Anchor's code generation. I discovered the anchor expand command to generate a single file with the Rust code passed to the compiler and I've been ridding it of the code portions that do not influence the error.

I haven't yet pinpointed the problem, but I suspect something has changed in Rust's data structures that interferes with function calls and stack variables.

godmodegalactus commented 5 months ago

I guess we can create a simpler example then if we could pinpoint where the issue comes from. Or we can try to test an anchor example.

acheroncrypto commented 5 months ago

As an update for this issue, the pyth tests failures mentioned in coral-xyz/anchor#2795 (comment) by @acheroncrypto are caused by the change in the minimum size for enums in Rust. I've fixed this bug in anza-xyz/rust#90. I ran the test with the fix and everything turned out green.

Nice! Have you checked any of the other failures too?

The OpenBook-V2 failure has been consuming more time, as it is a large contract that also depends on Anchor's code generation. I discovered the anchor expand command to generate a single file with the Rust code passed to the compiler and I've been ridding it of the code portions that do not influence the error.

Here is a much shorter example that is likely related: https://beta.solpg.io/65cbb30bcffcf4b13384cf5b (run locally)

I haven't yet pinpointed the problem, but I suspect something has changed in Rust's data structures that interferes with function calls and stack variables.

I think we might be using more memory somehow. The behavior on the example I've shared is very weird too.

LucasSte commented 5 months ago

I've found the cause for the problem in OpenBook. Your example @acheroncrypto is likely to be hitting the same problem, as I simplified the OpenBook contract so much that it looked like the code you showed.

What is the problem?

SBFv1 functions have a limited frame size of 4096 bytes (4 kb), so using too many stack variables risks overwriting the frame of the caller function. In the OpenBook example, the anchor-generated function try_accounts (this one) deserializes instructions and accounts, and performs all accounts check, with heavy stack use. Such a function can get quite big when an instruction utilizes many accounts, as it is the case for OpenBook.

In the example, try_accounts is writing a value in frame of create_market, which had stored on its stack a pointer address. It reads a wrong pointer value from the stack and tries to access it, leading to a memory access violation, because the address it had stored in the stack now contains gibberish.

SBFv2 introduces dynamic stack frames, so this problem won't exist anymore once we migrate to the new runtime.

Why wasn't this a problem in v1.37?

Algoside Rust updates, the LLVM backend is also updated. Although the SBF code generation hasn't had any modification, the LLVM target independent code generation is constantly updated. This time, the SROA (Scalar Replacement of Aggregates) pass, an optimization that breaks down structs in its individual values, had an update and is breaking down structs in different places in the code, using more stack space than before.

In v1.37, try_accounts utilizes exactly 4096 bytes of the stack, so a couple more allocations were needed for us to break the code. These extra allocations come from the new SROA pass.

Any solution?

Although we can disable the SROA pass, such a measure won't make try_accounts impervious to future optimization changes or overflowing its frame in case a contract utilizes too many accounts. A suggestion would be to break down that method in smaller ones, decreasing stack usage.

LucasSte commented 5 months ago

As an update for this issue, the pyth tests failures mentioned in coral-xyz/anchor#2795 (comment) by @acheroncrypto are caused by the change in the minimum size for enums in Rust. I've fixed this bug in anza-xyz/rust#90. I ran the test with the fix and everything turned out green.

Nice! Have you checked any of the other failures too?

I had a look at ido-pool, but the problem I've found is the same one as the one in OpenBook. We'll back-port the enum size bug fix to v1.18. Please, @acheroncrypto let us know you need anything else to get your PR merged.

acheroncrypto commented 5 months ago

Algoside Rust updates, the LLVM backend is also updated. Although the SBF code generation hasn't had any modification, the LLVM target independent code generation is constantly updated. This time, the SROA (Scalar Replacement of Aggregates) pass, an optimization that breaks down structs in its individual values, had an update and is breaking down structs in different places in the code, using more stack space than before.

Thanks for the explanation! I'll note that this looks like a major regression for our case because not only does the 10 account example I've given work on v1.37, but you can also add many more accounts to the instruction until you hit the transaction size limit (1232 bytes). It doesn't run into any stack issues even with many more accounts used compared to v1.39.

Although we can disable the SROA pass, such a measure won't make try_accounts impervious to future optimization changes or overflowing its frame in case a contract utilizes too many accounts. A suggestion would be to break down that method in smaller ones, decreasing stack usage.

The issue is that we can fix these problems in our tests, but it's likely that many of the production programs will also hit this problem once they start using solana-cli 1.18.

I had a look at ido-pool, but the problem I've found is the same one as the one in OpenBook. We'll back-port the enum size bug fix to v1.18. Please, @acheroncrypto let us know you need anything else to get your PR merged.

Thanks, we'll first need a new release that has the fixes to get the PR merged.

We also have some token 2022 tests failing, which I haven't yet debugged, but they are most likely not related to platform-tools.

LucasSte commented 5 months ago

Algoside Rust updates, the LLVM backend is also updated. Although the SBF code generation hasn't had any modification, the LLVM target independent code generation is constantly updated. This time, the SROA (Scalar Replacement of Aggregates) pass, an optimization that breaks down structs in its individual values, had an update and is breaking down structs in different places in the code, using more stack space than before.

Thanks for the explanation! I'll note that this looks like a major regression for our case because not only does the 10 account example I've given work on v1.37, but you can also add many more accounts to the instruction until you hit the transaction size limit (1232 bytes). It doesn't run into any stack issues even with many more accounts used compared to v1.39.

Those regressions are a concern, but I have good news. I've tested platform-tools version v1.41 in this Openbook-v2 issue, in Anchor's tests tests/pyth and tests/ido-pool and the new LLVM version made everything work again. We'll back-port v1.41 to Solana v1.18.

acheroncrypto commented 5 months ago

Thanks @LucasSte! The memory issues we had are fixed in the 1.18.8 release.

LucasSte commented 5 months ago

Thanks @LucasSte! The memory issues we had are fixed in the 1.18.8 release.

Thanks for the feedback. Can we close this issue?

acheroncrypto commented 5 months ago

I think so, yes.

anza-xyz / agave

Bug when changing from platform tools v1.37 to v1.39 #252

Problem

Proposed Solution