PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 103 forks source link

Q: Delete raw_reads.*.raw_reads.*.[CN]?.las files #283

Open bredeson opened 8 years ago

bredeson commented 8 years ago

Hey Falcon Devs,

Thank you for the great software! I have a question about whether falcon is restartable from a later stage if the raw_reads.*.raw_reads.*.[CN]?.las files in the 0-rawreads/job_* dirs are deleted? These files are redundant when a 0-rawreads/job_*/job_*_done file has been written and the L1.*.*.las files are present, and roughly doubles the storage space required of an assembly project. This creates (or will create) difficulties for large genomes.

Thanks for your help,

Jessen

pb-jchin commented 8 years ago

those are use in the first error correction step. If you only re-run the assembly from the error corrected reads and the *done files are there, then you don't them them.

pb-cdunn commented 8 years ago

Well, some steps depend on more than just the 'done' files. And if we list that dependency explicitly, and then you delete a dependency, then they will be recreated upon a restart, or upon any re-entry to the refresh-loop.

You're right. We don't currently clean up all intermediate files. And if they are not listed as dependencies anywhere, then you can delete them safely.

bredeson commented 8 years ago

I see in the pypeflow.log that these intermediate .las files are checked for when Falcon restarts, how trivial would it be for future versions of Falcon to clean these files up (and release them from being dependencies) when a job* is complete? I suppose it would require recursively checking that the job*_done file exists and that the corresponding L1.N.M.las file exists and is valid... Just a thought.

Thanks guys, your timely responses is appreciated.

Best, Jessen

pb-cdunn commented 8 years ago

I think we'd be better off now if only the done files were dependencies. Users would have some flexibility. I added them only because we had some cases of uncaught failures, but that is less of a problem today.

You can modify your own code for now. Pretty simple, and completely safe. (Some non-distributed tasks don't have 'done' files or equivalent, so don't change those.) I'll try to update this next week. No time today.