Xinglab / espresso

Other
48 stars 4 forks source link

Estimate time cost for ESPRESSO_C #29

Open chenxran opened 10 months ago

chenxran commented 10 months ago

Hi there, I am running espresso with 1M simulated long reads and I found that the ESPRESSO_C step is particularly time-consuming and just cannot finish. I've notice there are other people also commented on the duration of this step. Is it possible to provide a estimation of the time cost for the C step (e.g., it may take hours/days for 1M long reads)? Thank you!

EricKutschera commented 10 months ago

I just ran ESPRESSO on a single input SAM file with 1 million reads. I ran with 2 threads and it took 3 hours to finish the C step

The time it takes depends on the dataset and the number of C step jobs and threads. Increasing the number of threads should reduce the time. The snakemake can also split up the reads into small jobs if you have a compute cluster: https://github.com/Xinglab/espresso/blob/v1.3.2/snakemake/snakemake_config.yaml#L28

If the C step is taking much longer than expected then there might be something about your dataset that ESPRESSO is not handling well. If that's the case, I can try looking into the dataset if you post your inputs

chenxran commented 10 months ago

Hi Eric, thank you for your response, and sorry for the late update.

My command for running espresso is shown as follows:

perl ESPRESSO_S.pl \
    -L $sample_file \
    -F $genome_file \
    -A $gtf_file \
    -O $work_dir

perl ESPRESSO_C.pl \
    -I $work_dir \
    -F $genome_file \
    -T 20 \
    -X 0

My current running log looks like this:

Building a new DB, current time: 08/19/2023 10:21:18
New DB name:   /scratch/kinfai_root/kinfai0/chenxran/LRQuant/data/espresso-s-output/cDNA-ONT/1M/0/blast_7416/current_db
New DB title:  current_db
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 8 sequences in 0.00215292 seconds.

3780    0
3781    0
3782    0
3783    11

Building a new DB, current time: 08/19/2023 10:21:21
New DB name:   /scratch/kinfai_root/kinfai0/chenxran/LRQuant/data/espresso-s-output/cDNA-ONT/1M/0/blast_3783/current_db
New DB title:  current_db
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 11 sequences in 0.00165606 seconds.

212 8

Building a new DB, current time: 08/19/2023 10:21:22
New DB name:   /scratch/kinfai_root/kinfai0/chenxran/LRQuant/data/espresso-s-output/cDNA-ONT/1M/0/blast_212/current_db
New DB title:  current_db
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 3 sequences in 0.00317502 seconds.

Logs in similar formats repeat for about 10 hours and continued until the time limit was reached.

EricKutschera commented 10 months ago

The commands and log you posted look fine. Can you post the actual input files?

chenxran commented 10 months ago

By input file you mean the work dir I input for step-C, or the sample file I used for the first step?

For the sample file, my sample file looks like this:

/scratch/kinfai_root/kinfai0/chenxran/LRQuant/data/Generate_sim_data/cDNA-ONT/1M/simulated_aligned_reads.sorted.bam /scratch/kinfai_root/kinfai0/chenxran/LRQuant/data/espresso-s-output/cDNA-ONT/1M/1M

The work dir currently has the following files/folders:

drwxrwsr-x 12937 chenxran kinfai_root 512K Aug 19 10:12 0
-rw-rw-r--     1 chenxran kinfai_root  202 Aug 16 16:14 cDNA-ONT-1M.tsv.updated
-rw-rw-r--     1 chenxran kinfai_root 632K Aug 16 16:26 chr10_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 853K Aug 16 16:26 chr11_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 842K Aug 16 16:26 chr12_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 290K Aug 16 16:26 chr13_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 504K Aug 16 16:26 chr14_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 505K Aug 16 16:26 chr15_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 697K Aug 16 16:26 chr16_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 894K Aug 16 16:26 chr17_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 205K Aug 16 16:26 chr18_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 983K Aug 16 16:26 chr19_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 1.5M Aug 16 16:26 chr1_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 411K Aug 16 16:26 chr20_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 176K Aug 16 16:26 chr21_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 391K Aug 16 16:26 chr22_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 1.1M Aug 16 16:26 chr2_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 951K Aug 16 16:26 chr3_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 551K Aug 16 16:26 chr4_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 681K Aug 16 16:26 chr5_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 711K Aug 16 16:26 chr6_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 742K Aug 16 16:26 chr7_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 512K Aug 16 16:26 chr8_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 675K Aug 16 16:26 chr9_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root 516K Aug 16 16:26 chrX_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root  21K Aug 16 16:26 chrY_SJ_simplified.list
-rw-rw-r--     1 chenxran kinfai_root  13M Aug 16 16:26 SJ_group_all.fa

Here 0 is the only folder and it seems to save thousands of folders named "blast_{numbers}" generated from Step C.

Btw, I am curious if I failed in running step C, do I need to re-run step S as it seems that step C also writes something into the work dir?

Thanks again for your kindly help.

EricKutschera commented 10 months ago

By input I mean the .bam file, $genome_file, and $gtf_file. If you post links to those files then I can download them and run ESPRESSO to see if anything about those inputs causes ESPRESSO to be slow

You do not need to run the S step again if the C step fails. The code removes any old files before running a new C step: https://github.com/Xinglab/espresso/blob/v1.3.2/src/ESPRESSO_C.pl#L135

chenxran commented 10 months ago

Thank you again for your effort in helping solve my problem. I have wrapped all files into a zip file and uploaded it to Google Drive. I will send you the link via email.

EricKutschera commented 10 months ago

I was able to run using the files you uploaded. The C step took 1.5 hours with 5 threads. All the ESPRESSO steps together took 2 hours

I re-ran using -T 20 threads to match the command you posted and it took 15 hours. From watching the processes in htop it looks like with 20 threads the different threads spend most of the time blocking. I think the reason is that the C step involves running blastn and reading and writing temporary files to pass the data between ESPRESSO and blastn. The 20 threads end up competing for the filesystem and actually make the C step slower than with 5 threads

I usually run ESPRESSO with the snakemake which can be configured to split the reads over many small C step jobs where each C step job can run on a different machine. Running on different machines seems to avoid the issue of threads competing for the filesystem

For your data, I think running with -T 5 should let the C step finish in a few hours

chenxran commented 10 months ago

Great thanks! I will have a try for -T 5 right now. Hope that will also work for me.

chenxran commented 10 months ago

Hi Eric, I tried setting -T 5 while the program still got killed due to the time limit. I am wondering what the log file looks like for the C step in your runs. My log file only has about 10k lines for about 10-hour running. I am trying to further reduce -T 5 to -T 2 right now.

EricKutschera commented 10 months ago

The finished log file is about 190k lines. It starts out like:

[Mon Aug 28 10:22:45 2023] Loading splice junction info
[Mon Aug 28 10:24:27 2023] Requesting system to split SAMLIST into 5 pieces
 Divided SAM(LIST) sizes:
 sam.list3aa        280245027
 sam.list3ab        280245027
...
[Mon Aug 28 10:24:35 2023] Loading references
[Mon Aug 28 10:24:59 2023] Scanning SAMLIST by 2 workers
 Worker 1 begins to scan sam.list3aa.

Almost all of the file is lines like:

Building a new DB, current time: 08/28/2023 10:25:00
New DB name:   /scr1/users/kutscherae/espresso/snakemake/espresso_out/c_work_dir/0/0/blast_0/current_db
New DB title:  current_db
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 7 sequences in 0.0221341 seconds.

At the end it has:

Worker 2 finished reporting.
Worker 3 finished reporting.
Worker 1 finished reporting.
Worker 5 finished reporting.
Worker 4 finished reporting.
[Mon Aug 28 11:50:18 2023] ESPRESSO_C finished its work.

Since there are only 10k lines after 10 hours something is preventing the processes from getting work done. You could try watching the processes in htop to see if they're waiting for disk or if memory or CPUs on the machine are maxed out

chenxran commented 10 months ago

Hi Eric, I tried downgrading BLAST from 2.14.0 to 2.8.1, and it seems to work fine and speed up a lot. Maybe the latest version of BLAST will lead to such an issue. Thanks for your help!

EricKutschera commented 10 months ago

I tried running a 100k read dataset with a few BLAST versions. I tested with these BLAST versions: 2.10.1, 2.11.0, 2.12.0, 2.13.0, 2.14.0, 2.14.1

The ESPRESSO C steps with 2.10.1, 2.12.0, and 2.14.1 all finished quickly with 2.14.1 having the lowest ESPRESSO runtime. 2.13.0 took about 3 times as long to finish. 2.11.0 and 2.14.0 took about 6 times as long to finish

It's not clear from the BLAST release notes (https://www.ncbi.nlm.nih.gov/books/NBK131777/) what would cause the slowdown in any of those versions, but the release notes do mention a "major restructuring of the module that reads the BLAST databases"

chenxran commented 10 months ago

Not sure which issue caused this slowdown. From my observation under 2.14.0, suppose I set up -T 2, then the log output will look like this

Building a new DB, current time: 08/31/2023 14:27:58
New DB name:   1M/0/blast_0/current_db
New DB title:  current_db
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 7 sequences in 0.00545502 seconds.

8715    9

Building a new DB, current time: 08/31/2023 14:28:00
New DB name:   1M/0/blast_8715/current_db
New DB title:  current_db
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 9 sequences in 0.00442195 seconds.

1   12

Building a new DB, current time: 08/31/2023 14:31:00
New DB name:   1M/0/blast_1/current_db
New DB title:  current_db
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 11 sequences in 0.00478601 seconds.

8716    0
8717    24

Building a new DB, current time: 08/31/2023 14:31:02
New DB name:   1M/0/blast_8717/current_db
New DB title:  current_db
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 12 sequences in 0.00386095 seconds.

2   0
3   0
4   0
5   11

Building a new DB, current time: 08/31/2023 14:34:03
New DB name:   1M/0/blast_5/current_db
New DB title:  current_db
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 8 sequences in 0.00404501 seconds.

Building a new DB, current time: 08/31/2023 14:34:04
New DB name:   1M/0/blast_8717/current_db
New DB title:  current_db
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named 1M/0/blast_8717/current_db
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 7 sequences in 0.002249 seconds.

6   0
7   3

Building a new DB, current time: 08/31/2023 14:37:05
New DB name:   1M/0/blast_8717/current_db
New DB title:  current_db
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named 1M/0/blast_8717/current_db
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 5 sequences in 0.00215101 seconds.

Building a new DB, current time: 08/31/2023 14:37:06
New DB name:   1M/0/blast_7/current_db
New DB title:  current_db
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 3 sequences in 0.00331497 seconds.

8   0
9   2
10  34

Building a new DB, current time: 08/31/2023 14:40:08
New DB name:   1M/0/blast_10/current_db
New DB title:  current_db
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 16 sequences in 0.00232792 seconds.

8718    0
8719    6

Building a new DB, current time: 08/31/2023 14:40:11
New DB name:   1M/0/blast_8719/current_db
New DB title:  current_db
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 2 sequences in 0.00334597 seconds.

Every time the program completes the "adding sequences" operation for the number of thread times, it will always sleep for almost exactly 3min for an unknown reason. It just looks like there is a sleep(180) in the code.