algbio / matchtigs

Minimum plain text representation of kmer sets
BSD 2-Clause "Simplified" License
17 stars 5 forks source link

`std:bad_alloc`: How much memory do I need to compute matchtigs? #5

Closed enricorox closed 1 year ago

enricorox commented 1 year ago

Hi! I'm having trouble processing a bcalm file with 12,585,517 nodes and 14,600,062 arcs (sequence SRR061958_1, 21-mer minimum count 1). I don't have enough RAM even if I've allocated 330GB for this job. How can I estimate how much memory I need?

The command and output are below

$ matchtigs --k 21 -t 16 --bcalm-in SRR061958_1.fasta.unitigs.fa --matchtigs-fa-out SRR061958_1.fasta.matchtigs.fa 10:31:30 [INFO] Logging initialised successfully 10:31:30 [INFO] Reading bcalm2 fa as edge centric bigraph with k = 21 from "SRR061958_1.fasta-nc.unitigs.fa" 10:31:52 [INFO] Loading took 21.8 seconds 10:31:52 [INFO] The sequences take a total 65MiB of memory 10:31:52 [INFO] k = 21 10:31:52 [INFO] Graph has 12585517 nodes and 14600062 edges 10:31:52 [INFO] Computing edge weights for shortest path queries 10:31:52 [INFO] Computing matchtigs 10:31:52 [INFO] Collecting nodes with missing incoming or outgoing edges 10:31:53 [INFO] Found 5999952 nodes with missing outgoing edges 10:31:53 [INFO] Found 5999952 nodes with missing incoming edges 10:31:53 [INFO] Of those there are 412 self-mirrors 10:31:53 [INFO] Computing shortest paths between nodes with missing outgoing and nodes with missing incoming edges 10:31:53 [INFO] Using 16 threads to run ~5999952 dijkstras 10:31:54 [INFO] Creating dijkstra threads 10:31:54 [INFO] Waiting for dijkstra threads to finish 10:32:34 [INFO] Found 776546395 shortest paths 10:37:19 [INFO] Took 326.082641s for computing paths and getting edges, of this 40.782990s are from dijkstra 10:37:19 [INFO] Found 3107725 nodes and 727576937 edges 10:37:19 [INFO] Matching problem contains 179 edges that originate from 1385 mirror biedges 10:37:19 [INFO] Computing WCCs of graph 10:37:23 [INFO] Filtering irrelevant WCCs 10:37:23 [INFO] Found 14098 relevant WCCs 10:37:23 [INFO] Assigning extra nodes to nodes in the matching instance 10:37:23 [INFO] Outputting matching problem to "SRR061958_1.fasta.matchtigs.fa.minimalperfectmatching" 10:43:41 [INFO] Running matcher at "blossom5" thread 'main' panicked at 'Matcher was unsuccessful: signal: 6 (SIGABRT) (core dumped) stderr: terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc ', src/implementation/matchtigs/mod.rs:746:9 stack backtrace: 0: rust_begin_unwind at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:584:5 1: core::panicking::panic_fmt at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:142:14 2: matchtigs::implementation::matchtigs::compute_matchtigs at /opt/conda/conda-bld/matchtigs_1670488620267/work/src/implementation/matchtigs/mod.rs:746:9 3: matchtigs::implementation::matchtigs::compute_matchtigs_choose_node_weight_array_type at /opt/conda/conda-bld/matchtigs_1670488620267/work/src/implementation/matchtigs/mod.rs:127:13 4: matchtigs::implementation::matchtigs::compute_matchtigs_choose_heap_type at /opt/conda/conda-bld/matchtigs_1670488620267/work/src/implementation/matchtigs/mod.rs:93:13 5: <matchtigs::implementation::matchtigs::MatchtigAlgorithm as matchtigs::implementation::TigAlgorithm>::compute_tigs at /opt/conda/conda-bld/matchtigs_1670488620267/work/src/implementation/matchtigs/mod.rs:70:9 6: matchtigs::main at /opt/conda/conda-bld/matchtigs_1670488620267/work/src/bin.rs:1135:25 7: core::ops::function::FnOnce::call_once at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/ops/function.rs:248:5 note: Some details are omitted, run with RUST_BACKTRACE=full for a verbose backtrace.

enricorox commented 1 year ago

Ok, the exception was due to blossom5 and not matchtigs. Maybe a warning about the memory requirement can be helpful since it requires O(|V|^2) if I'm correct.

sebschmi commented 1 year ago

Hi, thank you for trying out matchtigs! You are right, that matchtigs themselves are not suitable for large graphs and dense graphs, as they require a lot of memory. Use the greedy heuristic instead, it is much more memory-friendly.


From: Enrico Rossignolo @.> Sent: 22 June 2023 20:34 To: algbio/matchtigs @.> Cc: Subscribed @.***> Subject: Re: [algbio/matchtigs] std:bad_alloc: How much memory do I need to compute matchtigs? (Issue #5)

Ok, the exception was due to blossom5 and not matchtigs. Maybe a warning about the memory requirement can be helpful since it requires O(|V|^2) if I'm correct.

— Reply to this email directly, view it on GitHubhttps://github.com/algbio/matchtigs/issues/5#issuecomment-1603060227, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AORQGR6BM262YWYQ4VK7JNLXMR6ZFANCNFSM6AAAAAAZQBPWPU. You are receiving this because you are subscribed to this thread.Message ID: @.***>

enricorox commented 1 year ago

Thanks for the reply! I was just wondering how much better is the optimum solution wrt the one greedily computed 👍🏽

sebschmi commented 1 year ago

You can check the paper for some sample experiments. It is hard to estimate the amount of memory just from the number of nodes and edges. I have at least updated the documentation to warn about the high memory requirements for optimal matchtigs.

In general, also a large graph may be feasible to compute, it all depends on how many paths of length <= k (or possibly k-1, check the paper if needed ;) ) between nodes there are.