PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 102 forks source link

determining how many jobs to go #46

Open dgordon562 opened 9 years ago

dgordon562 commented 9 years ago

Hi, Jason,

At the initial daligner stage (where it is working in 0-rawreads), it submits hundreds of jobs with names such as: d_7741e0bb_raw_reads-ddbf9706. Is there any way to determine how many more such jobs there will be? Or, since I know how many have been submitted (completed and still running), is there any way to know the total that must be completed? Is this information available, for example, in raw_reads.db (or can it be calculated from there)? I just want to know whether I am, for example, 30% done or 99% done.

Thanks, David

pb-jchin commented 9 years ago

The job directory are created before the job submitted. you can get that from the number of directory. Or, you can check the runjobs.sh, it actually lists the individual jobs. I typically do a `find . -name "job*done" | wc` to figure out the progress.

dgordon562 commented 9 years ago

Great! Thanks!

Looking at the daligner jobs, the first uses 2 input files, the next uses 3 input files, the next uses 4 input files ...

Thus is the 200th job going to take twice as long as the 100th job? And thus I need to do a quadratic projection of time to complete rather than a linear projection ( e.g., the 2nd half of the jobs will take 3 times as long as the 1st half)?

pb-jchin commented 9 years ago

@dgordon562 I think Gene's blog has some description how he partitions the jobs. For large assembly, those smaller jobs are faster. And the larger jobs are roughly the same size, so I typically do linear extrapolation for estimating finish time. It works good so far.

The other thing you can do is to get the run time for each individual block comparison and there are N*(N-1)/2 block comparison, you can estimate total CPU time from that. The wall clock time does depend o how even of the partitions is.