dib-lab / 2020-paper-sourmash-gather

Here we describe an extension of MinHash that permits accurate compositional analysis of metagenomes with low memory and disk requirements.
https://dib-lab.github.io/2020-paper-sourmash-gather
Other
8 stars 1 forks source link

sourmash benchmarking #47

Open ctb opened 2 years ago

ctb commented 2 years ago

searching 1,216,187 genomes from Genbank genomes, as of March 28, 2022 -

sourmash gather:

Sample Time Memory num overlap min set cov
p8808mo11 3h 24m 95.1 GB 671,092 157
SRR12324253 2h 39m 100.3 GB 723,079 24
SRR1976948 3h 22m 58.3 GB 178,121 208
SRR606249 2h 49m 57.5 GB 172,247 157

prefetch:

Sample Time Memory Num results
p8808mo11 3h 43m 40.2 GB 671,092
SRR12324253 3h 02m 40.2 GB 723,079
SRR1976948 3h 34 40.2 GB 178,120
SRR606249 2h 56m 40.2GB 172,247

gather with prefetch picklist:

Sample Time Memory Num results
p8808mo11 2h 51m 89.5 GB 157
SRR12324253 1h 55m 95.7 GB 24
SRR1976948 0h 56m 32.7 GB 208
SRR606249 0h 43m 31.2GB 86
 Command being timed: "sourmash prefetch -k 31 p8808mo11.abundtrim.sig.gz genbank-2022.03/genbank-2022.03-archaea-k31.zip genbank-2022.03/genbank-2022.03-bacteria-k31.zip genbank-2022.03/genbank-2022.03-fungi-k31.zip genbank-2022.03/genbank-2022.03-protozoa-k31.zip genbank-2022.03/genbank-2022.03-viral-k31.zip -o p8808mo11.prefetch.csv"
    User time (seconds): 13251.99
    System time (seconds): 85.70
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 3:43:24
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 40203608
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 2097
    Minor (reclaiming a frame) page faults: 43474606
    Voluntary context switches: 14849
    Involuntary context switches: 1695569
    Swaps: 0
    File system inputs: 83537784
    File system outputs: 375344
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

and

"sourmash gather -k 31 --picklist p8808mo11.prefetch.csv::prefetch p8808mo11.abundtrim.sig.gz genbank-2022.03/genbank-2022.03-archaea-k31.zip genbank-2022.03/genbank-2022.03-bacteria-k31.zip genbank-2022.03/genbank-2022.03-fungi-k31.zip genbank-2022.03/genbank-2022.03-protozoa-k31.zip genbank-2022.03/genbank-2022.03-viral-k31.zip -o p8808mo11.gather.csv"
    User time (seconds): 9691.07
    System time (seconds): 531.89
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 2:50:54
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 89530484
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 14237
    Minor (reclaiming a frame) page faults: 562213908
    Voluntary context switches: 18586
    Involuntary context switches: 1894971
    Swaps: 0
    File system inputs: 12891664
    File system outputs: 1456
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

and all-in-one prefetch+gather:

"sourmash gather -k 31 p8808mo11.abundtrim.sig.gz genbank-2022.03/genbank-2022.03-archaea-k31.zip genbank-2022.03/genbank-2022.03-bacteria-k31.zip genbank-2022.03/genbank-2022.03-fungi-k31.zip genbank-2022.03/genbank-2022.03-protozoa-k31.zip genbank-2022.03/genbank-2022.03-viral-k31.zip -o p8808mo11.gather.alone.csv"
    User time (seconds): 11815.18
    System time (seconds): 386.17
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 3:23:55
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 95106616
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 2057
    Minor (reclaiming a frame) page faults: 703070966
ctb commented 2 years ago

SRR12324253

        Command being timed: "sourmash prefetch -k 31 SRR12324253.abundtrim.sig.gz genbank-2022.03/genbank-2022.03-archaea-k31.zip genbank-2022.03/genbank-2022.03-bacteria-k31.zip genbank-2022.03/genbank-2022.03-fungi-k31.zip genbank-2022.03/genbank-2022.03-protozoa-k31.zip genbank-2022.03/genbank-2022.03-viral-k31.zip -o SRR12324253.prefetch.csv"
        User time (seconds): 10815.54
        System time (seconds): 58.50
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 3:01:57
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 40201292
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 2510
        Minor (reclaiming a frame) page faults: 32519901
        Voluntary context switches: 17903
        Involuntary context switches: 1691148
        Swaps: 0
        File system inputs: 83594008
        File system outputs: 404088
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

        Command being timed: "sourmash gather -k 31 --picklist SRR12324253.prefetch.csv::prefetch SRR12324253.abundtrim.sig.gz genbank-2022.03/genbank-2022.03-archaea-k31.zip genbank-2022.03/genbank-2022.03-bacteria-k31.zip genbank-2022.03/genbank-2022.03-fungi-k31.zip genbank-2022.03/genbank-2022.03-protozoa-k31.zip genbank-2022.03/genbank-2022.03-viral-k31.zip -o SRR12324253.gather.csv"
        User time (seconds): 6719.18
        System time (seconds): 170.06
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:54:53
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 95725764
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 2
        Minor (reclaiming a frame) page faults: 336997207
        Voluntary context switches: 2100
        Involuntary context switches: 1327653
        Swaps: 0
        File system inputs: 7784
        File system outputs: 232
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

        Command being timed: "sourmash gather -k 31 SRR12324253.abundtrim.sig.gz genbank-2022.03/genbank-2022.03-archaea-k31.zip genbank-2022.03/genbank-2022.03-bacteria-k31.zip genbank-2022.03/genbank-2022.03-fungi-k31.zip genbank-2022.03/genbank-2022.03-protozoa-k31.zip genbank-2022.03/genbank-2022.03-viral-k31.zip -o SRR12324253.gather.alone.csv"
        User time (seconds): 9349.63
        System time (seconds): 183.31
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:38:59
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 100313252
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 340268298
        Voluntary context switches: 2103
        Involuntary context switches: 1765414
        Swaps: 0
        File system inputs: 7352
        File system outputs: 240
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

SRR1976948

        Command being timed: "sourmash prefetch -k 31 SRR1976948.abundtrim.sig.gz genbank-2022.03/genbank-2022.03-archaea-k31.zip genbank-2022.03/genbank-2022.03-bacteria-k31.zip genbank-2022.03/genbank-2022.03-fungi-k31.zip genbank-2022.03/genbank-2022.03-protozoa-k31.zip genbank-2022.03/genbank-2022.03-viral-k31.zip -o SRR1976948.prefetch.csv"
        User time (seconds): 12797.64
        System time (seconds): 38.73
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 3:34:12
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 40245220
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 42615281
        Voluntary context switches: 2190
        Involuntary context switches: 1967380
        Swaps: 0
        File system inputs: 30224
        File system outputs: 98920
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

        Command being timed: "sourmash gather -k 31 --picklist SRR1976948.prefetch.csv::prefetch SRR1976948.abundtrim.sig.gz genbank-2022.03/genbank-2022.03-archaea-k31.zip genbank-2022.03/genbank-2022.03-bacteria-k31.zip genbank-2022.03/genbank-2022.03-fungi-k31.zip genbank-2022.03/genbank-2022.03-protozoa-k31.zip genbank-2022.03/genbank-2022.03-viral-k31.zip -o SRR1976948.gather.csv"
        User time (seconds): 3327.80
        System time (seconds): 70.64
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 56:43.53
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 32661224
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 128883947
        Voluntary context switches: 2116
        Involuntary context switches: 987203
        Swaps: 0
        File system inputs: 30192
        File system outputs: 488
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

        Command being timed: "sourmash gather -k 31 SRR1976948.abundtrim.sig.gz genbank-2022.03/genbank-2022.03-archaea-k31.zip genbank-2022.03/genbank-2022.03-bacteria-k31.zip genbank-2022.03/genbank-2022.03-fungi-k31.zip genbank-2022.03/genbank-2022.03-protozoa-k31.zip genbank-2022.03/genbank-2022.03-viral-k31.zip -o SRR1976948.gather.alone.csv"
        User time (seconds): 11999.73
        System time (seconds): 125.85
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 3:22:29
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 58319908
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 242282095
        Voluntary context switches: 2128
        Involuntary context switches: 1477330
        Swaps: 0
        File system inputs: 30192
        File system outputs: 512
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
ctb commented 2 years ago

SRR606249

        Command being timed: "sourmash prefetch -k 31 SRR606249.abundtrim.sig.gz genbank-2022.03/genbank-2022.03-archaea-k31.zip genbank-2022.03/genbank-2022.03-bacteria-k31.zip genbank-2022.03/genbank-2022.03-fungi-k31.zip genbank-2022.03/genbank-2022.03-protozoa-k31.zip genbank-2022.03/genbank-2022.03-viral-k31.zip -o SRR606249.prefetch.csv"
        User time (seconds): 10519.05
        System time (seconds): 36.37
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:56:04
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 40231448
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 31396954
        Voluntary context switches: 2205
        Involuntary context switches: 1459688
        Swaps: 0
        File system inputs: 18584
        File system outputs: 94536
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

        Command being timed: "sourmash gather -k 31 --picklist SRR606249.prefetch.csv::prefetch SRR606249.abundtrim.sig.gz genbank-2022.03/genbank-2022.03-archaea-k31.zip genbank-2022.03/genbank-2022.03-bacteria-k31.zip genbank-2022.03/genbank-2022.03-fungi-k31.zip genbank-2022.03/genbank-2022.03-protozoa-k31.zip genbank-2022.03/genbank-2022.03-viral-k31.zip -o SRR606249.gather.csv"
        User time (seconds): 2536.77
        System time (seconds): 50.03
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 43:08.91
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 31155660
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 109382702
        Voluntary context switches: 2104
        Involuntary context switches: 851265
        Swaps: 0
        File system inputs: 18552
        File system outputs: 336
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

        Command being timed: "sourmash gather -k 31 SRR606249.abundtrim.sig.gz genbank-2022.03/genbank-2022.03-archaea-k31.zip genbank-2022.03/genbank-2022.03-bacteria-k31.zip genbank-2022.03/genbank-2022.03-fungi-k31.zip genbank-2022.03/genbank-2022.03-protozoa-k31.zip genbank-2022.03/genbank-2022.03-viral-k31.zip -o SRR606249.gather.alone.csv"
        User time (seconds): 10030.44
        System time (seconds): 100.26
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:49:00
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 57475324
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 204563557
        Voluntary context switches: 2117
        Involuntary context switches: 1596718
        Swaps: 0
        File system inputs: 18552
        File system outputs: 784
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
ctb commented 2 years ago

the most interesting biological thing here is the increase in the number of gather matches, especially for SRR1976948, the oil well one - from 135 to 208, which I wouldn't necessarily have expected! cool that really new genomes/organisms/etc keep on being entered into the databases!