algbio / ggcat

Compacted and colored de Bruijn graph construction and querying
MIT License
72 stars 10 forks source link

GGCAT API crash when building many graphs in memory from the same instance #40

Closed tmaklin closed 6 months ago

tmaklin commented 7 months ago

Link to files + code that reproduce the crash on my system (Fedora 39 Linux 6.6.13-200.fc39.x86_64) at the end.

Description

GGCAT API seems to have a bug where using the API to build many (> 100) graphs from the same instance initialized with prefer_memory: true eventually causes a panic with error message:

thread 'main' panicked at /home/temaklin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parallel-processor-0.1.13/src/memory_fs/file/internal.rs:248:26:
called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I looked more into this by making the panicking function (create_writing_underlying_file in parallel-processor) print the file it's attempting to access, and the panic seems to be caused by the instance going into a state where it thinks that it has run out of memory after building some of the graphs. The first graphs are built normally in memory (no temporary files are created) but after a while the building seems to switch to 100% on disk. This eventually causes a crash with the error:

    create_writing_underlying_file: tmp/build_graph_95c7a77f-d9b1-4028-989f-f5676fdf4417/result.997
    create_writing_underlying_file: tmp/build_graph_95c7a77f-d9b1-4028-989f-f5676fdf4417/result.998
    create_writing_underlying_file: tmp/build_graph_95c7a77f-d9b1-4028-989f-f5676fdf4417/result.999
    create_writing_underlying_file: tmp/build_graph_02d33ca3-550c-42c5-b3f2-993a06afe332/maximal-links.207
thread 'main' panicked at /home/temaklin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parallel-processor-0.1.13/src/memory_fs/file/internal.rs:249:26:
called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

because the file tmp/build_graph_02d33ca3-550c-42c5-b3f2-993a06afe332/maximal-links.207 doesn't exist in the temporary directory.

I also tried calling run_assembler directly but it results in the same crash, so it seems like the API isn't the issue.

Code

use std::collections::HashMap;
use std::path::PathBuf;

fn build_pangenome_graph(input_seq_names: &[String], prefix: &String, instance: &ggcat_api::GGCATInstance) {
    println!("Building graph {} from {} sequences:", prefix, input_seq_names.len());
    input_seq_names.iter().for_each(|x| { println!("\t{}", x) });

    let graph_file = PathBuf::from(prefix.to_string());
    let ggcat_inputs: Vec<ggcat_api::GeneralSequenceBlockData> = input_seq_names
        .iter()
        .map(|x| ggcat_api::GeneralSequenceBlockData::FASTA((PathBuf::from(x), None)))
        .collect();

    instance.build_graph(
        ggcat_inputs,
        graph_file,
        Some(input_seq_names),
        51 as usize,
        4 as usize,
        false,
        None,
        false, // No colors
        1 as usize,
        ggcat_api::ExtraElaboration::GreedyMatchtigs,
    );
}

fn main() {
    // Read in the inputs
    let f = std::fs::File::open("clusters_morethanone.tsv").unwrap();
    let mut reader = csv::ReaderBuilder::new()
        .delimiter(b'\t')
        .has_headers(false)
        .from_reader(f);
    let mut seqs_to_clusters: HashMap<String, Vec<String>> = HashMap::new();
    for line in reader.records().into_iter() {
        let record = line.unwrap();
        let key = record[0].to_string().clone();
        let val = record[1].to_string().clone();

    if seqs_to_clusters.contains_key(&key) {
            seqs_to_clusters.get_mut(&key).unwrap().push(val.clone());
        } else {
            seqs_to_clusters.insert(key.clone(), vec![val.clone()]);
        }
    }

    let config = ggcat_api::GGCATConfig {
        temp_dir: Some(PathBuf::from("tmp")),
        memory: 2.0 as f64,
        prefer_memory: true,
        total_threads_count: 4 as usize,
        intermediate_compression_level: None,
        stats_file: None,
    };

    let instance = ggcat_api::GGCATInstance::create(config);

    // Build 170 graphs with > 1 genomes each
    seqs_to_clusters
        .iter()
        .for_each(|x| build_pangenome_graph(x.1, x.0, &instance));
}

Reproducing

Download the files from https://drive.google.com/file/d/11wj5h6D40zgQcncmCbNRhBT73HeFiAec/view?usp=sharing and run using cargo build --release && target/release/ggcat-tmpfiles-crash.

Guilucand commented 6 months ago

Hi! I fixed the problem, it was due to some files remaining in the memory cache after the completion of a build task, that after some time were offloaded to disk on a directory that was already deleted. Now I remove the files associated with a task every time it finishes.

tmaklin commented 6 months ago

thanks!!