PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 102 forks source link

How to get fc_run.py to skip some bad m_ directories #391

Open dgordon562 opened 8 years ago

dgordon562 commented 8 years ago

Hi, Chris,

I sometimes would prefer to have falcon just skip a problem las file than have to debug it. Right now, for example, I have 3 (out of thousands) of m_ directories that are giving LA4Falcon problems and holding up the whole assembly. In the past I've generally debugged it and sometimes that involves rerunning daligner and redoing days of work. But this time I just want to move on.

So how to fake out fc_run.py so it doesn't try to run LA4Falcon on these particular blocks.

Sure, I can set the "done" flag in 0-rawreads/preads. But what should I do about the 0-rawreads/preads/out.fasta file? Ideally, I'd like this sequence to not be put into preads4falcon.fasta. Could I make the out.fasta an empty file? Or have it missing? Or make it a sequence with one base? Or ??? Anything you could suggest that would be easy?

Thanks! David

pb-cdunn commented 8 years ago

Let's see...

  1. One idea is to generate an empty .las file. That can be done with restrictive daligner settings. Maybe you can try that in a pinch.
  2. Another is to touch the .fasta. For now, that would leave an empty file, which some tools will barf on. I'd like to fix such tools, but I might not have the authority at PacBio.
  3. Another is to create a 0-length fasta, with a header.
  4. A 1-length header would definitely work.
  5. "Missing" will not work today. That will be possible someday (after I move every task into a unique directory) but I can't guess when.

The main problem is that you need to create the right done files too. And remember to rm -rf mypwatcher/ before restarting.

dgordon562 commented 8 years ago

Thanks!

I think in #4 you mean a 1-length sequence such as:

dummy a

correct?

pb-cdunn commented 8 years ago

Yes, just 1 base. But you need a header that fasta2DB can parse. (In 0.6 we had fasta2fasta2.py to be more lenient, but since then Gene improved fasta2DB. So we've removed fasta2fasta.py, for it's very slow on large genomes.)

Basically, your ideas are good, but that one is the most likely to succeed.