Time taken by ESPRESSO_C.pl

ashokpatowary commented 1 year ago

Is there a way to understand/ monitor the progress of ESPRESSO_C.pl by examining the partial output file?

Thanks

EricKutschera commented 1 year ago

The C step writes a temporary file as each thread is working. The files are in an all/ temporary directory within the numbered directory that the C job is working in. For example, one file could be work_dir/0/all/0_thread_chr1_read_final.txt. Those files essentially have the corrected alignment information for each input read. Each read can have multiple lines in the file, but you could see how many reads have been processed so far with a command like: cut -f 1 work_dir/0/all/0_thread_chr1_read_final.txt | uniq | wc -l

You can see how many total reads that C job has from wc -l work_dir/0/sam.list3

Having the code print the number of finished reads would be a good feature to add for a future version

The C step can take a long time. If you use the snakemake workflow it can split up the reads into many jobs where each job does not take too long. If you have access to multiple machines that can help reduce the overall time: https://github.com/Xinglab/espresso/blob/v1.3.1/snakemake/Snakefile#L454

ashokpatowary commented 1 year ago

@EricKutschera a follow up query.

Can we split the sam file by chromosome number and run independent ESPRESSO_C.pl in each chromosomal sam file? We are planning to generate over 500m reads from multiple samples and thinking of ways to run it.

Another query; can we keep adding new samples in the list (-L) while running ESPRESSO_Q.pl?

Thanks

EricKutschera commented 1 year ago

Yes you could split each sam file by chromosome and then run ESPRESSO_C.pl for each file after the split. If you follow https://github.com/Xinglab/espresso/tree/v1.3.2#basic-usage then there is 1 C step for each input file. You can have multiple input files for the same sample name in the samples.tsv file and the results from inputs with the same sample name will be aggregated in the Q step. The samples.tsv file allows you to manually split the reads in a sample over C step jobs

https://github.com/Xinglab/espresso/blob/v1.3.2/snakemake/scripts/split_espresso_s_output_for_c.py is used by the snakemake to split the reads in a sample so that the C step jobs will be efficient. If you can use the snakemake or call the split_espresso_s_output_for_c.py and combine_espresso_c_output_for_q.py scripts directly then you may get a more even distribution of reads over C step jobs

can we keep adding new samples in the list (-L) while running ESPRESSO_Q.pl?

I'm not sure what you mean. The Q step will read the -L file once at the beginning so changing it while the code is running should have no effect. The Q step uses some grouping information from the S step so it only makes sense to run the Q step on results from the same S step. Also the Q step defines isoforms by looking at all samples and then it quantifies the abundance for each sample. Changing the set of samples in the Q step could change the set of detected isoforms

ashokpatowary commented 1 year ago

Hi @EricKutschera

Sorry for the confusion; but you have cleared all my doubts.

Regards Ashok

Xinglab / espresso

Time taken by ESPRESSO_C.pl #19