dib-lab / kProcessor

kProcessor: kmers processing framework.
https://kprocessor.readthedocs.io
BSD 3-Clause "New" or "Revised" License
11 stars 1 forks source link

DiffExp usecase issue #75

Closed mr-eyes closed 3 years ago

mr-eyes commented 3 years ago

Note: The time taken for the two KMC DBs to be loaded and counted is 2.5 hrs.

Reproduce:

# Create data dir
mkdir data && cd data

# Download and extract the human protein-coding transcripts
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.pc_transcripts.fa.gz
gunzip gencode.v37.pc_transcripts.fa.gz

# Generate names file by gene
grep ">" gencode.v37.pc_transcripts.fa | cut -c2- |  awk -F'|' '{print $0"\t"$2}' > gencode.v37.pc_transcripts.fa.names

# Download SRA samples
prefetch --progress --resume yes --verify yes --output-directory ./ DRR252191
prefetch --progress --resume yes --verify yes --output-directory ./ DRR252181

# Fastq dumping
fastq-dump  DRR252191.sra --skip-technical
fastq-dump  DRR252181.sra --skip-technical

# KMC DB
kmc -ci1 -t6 -k31  DRR252191.fastq DRR252191.kmc ./
kmc -ci1 -t6 -k31  DRR252181.fastq DRR252181.kmc ./

cd ..

# Run diff exp
/usr/bin/time -v ./kDifferentialExpression/kDifferntialExpression -g data/gencode.v37.pc_transcripts.fa -s data/DRR252191.kmc -c data/DRR252181.kmc -o test_out

Output

Load data/DRR252191.kmc kmers: 1131031196
Total count = 9.6342e+08
Load data/DRR252181.kmc kmers: 1454212964
Total count = 1.15643e+09
Command terminated by signal 11
        Command being timed: "./kDifferentialExpression/kDifferntialExpression -g data/gencode.v37.pc_transcripts.fa -s data/DRR252191.kmc -cdata/DRR252181.kmc -o test_out"
        User time (seconds): 13502.19
        System time (seconds): 116.82
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 3:47:33
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 12059576
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 6695172
        Voluntary context switches: 1568
        Involuntary context switches: 1303440
        Swaps: 0
        File system inputs: 4219424
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
shokrof commented 3 years ago

Fixed on V2 branch(7daad822). the whole usecase takes now 33 minutes not only the loading. this is the output of time Command being timed: "./kDifferntialExpression -g data/gencode.v37.pc_tran scripts.fa -s data/DRR252181.kmc -c data/DRR252191.kmc -o data/out" User time (seconds): 1982.93 System time (seconds): 53.16 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 33:57.70 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 18770860 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 10925536 Voluntary context switches: 41 Involuntary context switches: 199930 Swaps: 0 File system inputs: 0 File system outputs: 896 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0