cschin / Peregrine

Peregrine: Fast Genome Assembler Using SHIMMER Index
Other
101 stars 9 forks source link

crash in 2-ovlp #16

Open macmanes opened 4 years ago

macmanes commented 4 years ago

Hi:

Error in the 2-ovlp stage. Several of the chunks finish successfully (14 of 24), and several of them (10 of 24) fail as per the attached logs. I'm running like this on a machine with 768Gb RAM. I have about 50x PacBio data for a large genome.

pg_run.py asm \
/mnt/lustre/macmaneslab/ams1236/imitator_genome/falcon/PacBioFastaFiles.fofn \
24 24 24 24 24 24 24 24 24 \
--with-consensus --shimmer-r 3 --best_n_ovlp 8 \
--output peregrine_run

run-Pc5724b9d1fd6a8.bash.stdout.txt run-Pc5724b9d1fd6a8.bash.stderr.txt

macmanes commented 4 years ago

anything?

cschin commented 4 years ago

(1) Have you tried to re-run it? If it is an OOM issue, re-runing may resume the jobs. It seems to me you should have enough memory for that. (2) I just want to confirm the data is with CCS reads. (3) If you can share the data, I can take a look in detail. Thanks.

macmanes commented 4 years ago
  1. Yes, re-running fails the same way. I agree - no signal from software of SLURM of OOM.
  2. No, these are CLR reads. Sequel, 8M chips.
  3. I can share the data, but at 300Gb fasta it will take a while.
cschin commented 4 years ago

The current algorithm in Peregrine to detect overlaps is designed for reads with lengths > 10k and accuracy > 99%. This is not specific for the CCS reads but the CLR reads won't satisfy the accuracy requirement. I would be hoping to see how to speed up CLR assembly but given a limited amount of resources (as this is a second-night-time-job project), we will focus on accurate reads first.