HingeAssembler / HINGE

Software accompanying "HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution"
http://genome.cshlp.org/content/27/5/747.full.pdf+html?sid=39918b0d-7a7d-4a12-b720-9238834902fd
Other
64 stars 9 forks source link

HINGE process getting killed with large datasets due to single large las file (?) #136

Open alimayy opened 6 years ago

alimayy commented 6 years ago

Hi guys,

Recently we've realised that the way we run the HINGE pipeline causes a process in the chain getting killed for large PacBio and ONT datasets. I think this is related to issue 130 where a large, single las file is being read into the memory. The dataset and las file properties as follows:

PacBio dataset: 1,282,848 reads, 6.4 Gb yield hinge.las file size: 138Gb

ONT dataset: 756,656 reads, 6.4 Gb yield hinge.las (gzipped) file size: 104G

I'm pasting the log from our wrapper and the error for the PacBio dataset (the error for the ONT dataset was the same). What should I add in the pipeline to prevent this issue?

Running HINGE for sample EQ0170-E01-c05-1

*********Executing the command**************
hinge correct-head EQ0170E01c051_27667_subreads.fasta EQ0170E01c051_27667_subreads_f.fasta fasta_map.txt
------------------------------------------------

*********Executing the command**************
fasta2DB hinge EQ0170E01c051_27667_subreads_f.fasta
------------------------------------------------

*********Executing the command**************
DBsplit -x500 -s100 hinge
------------------------------------------------

*********Executing the command**************
HPC.daligner -t5 -T32 hinge| csh -v > /dev/null 2>&1
------------------------------------------------

*********Executing the command**************
LAmerge hinge.las hinge*.las
------------------------------------------------

*********Executing the command**************
DASqv -c100 hinge hinge.las > /dev/null 2>&1
------------------------------------------------

*********Executing the command**************
hinge filter --db hinge --las hinge.las -x hinge --config /HINGE/utils/nominal.ini
------------------------------------------------
[2017-11-29 03:10:07.279] [log] [info] Reads filtering
[2017-11-29 03:10:07.279] [log] [info] name of db: hinge, name of .las file hinge.las
[2017-11-29 03:10:07.280] [log] [info] name of fasta: , name of .paf file 
[2017-11-29 03:10:07.280] [log] [info] Parameters passed in 

[filter]
length_threshold = 1000;
quality_threshold = 0.23;
n_iter = 3; // filter iteration
aln_threshold = 1000;
min_cov = 5;
cut_off = 300;
theta = 300;
use_qv = true;

[running]
n_proc = 12;

[draft]
min_cov = 10;
trim = 200;
edge_safe = 100;
tspace = 900;
step = 50;

[consensus]
min_length = 4000;
trim_end = 200;
best_n = 1;
quality_threshold = 0.23;

[layout]
hinge_slack = 1000
min_connected_component_size = 8

[2017-11-29 03:10:07.808] [log] [info] Las files: hinge.las
[2017-11-29 03:10:07.808] [log] [info] # Reads: 1175869
[2017-11-29 03:10:48.786] [log] [info] No debug restrictions.
[2017-11-29 03:10:49.845] [log] [info] use_qv_mask set to true
[2017-11-29 03:10:49.845] [log] [info] use_qv_mask set to true
[2017-11-29 03:10:49.845] [log] [info] number processes set to 12
[2017-11-29 03:10:49.845] [log] [info] LENGTH_THRESHOLD = 1000
[2017-11-29 03:10:49.845] [log] [info] QUALITY_THRESHOLD = 0.23
[2017-11-29 03:10:49.845] [log] [info] N_ITER = 3
[2017-11-29 03:10:49.845] [log] [info] ALN_THRESHOLD = 1000
[2017-11-29 03:10:49.845] [log] [info] MIN_COV = 5
[2017-11-29 03:10:49.845] [log] [info] CUT_OFF = 300
[2017-11-29 03:10:49.845] [log] [info] THETA = 300
[2017-11-29 03:10:49.845] [log] [info] EST_COV = 0
[2017-11-29 03:10:49.845] [log] [info] reso = 40
[2017-11-29 03:10:49.845] [log] [info] use_coverage_mask = true
[2017-11-29 03:10:49.845] [log] [info] COVERAGE_FRACTION = 3
[2017-11-29 03:10:49.845] [log] [info] MIN_REPEAT_ANNOTATION_THRESHOLD = 10
[2017-11-29 03:10:49.845] [log] [info] MAX_REPEAT_ANNOTATION_THRESHOLD = 20
[2017-11-29 03:10:49.845] [log] [info] REPEAT_ANNOTATION_GAP_THRESHOLD = 300
[2017-11-29 03:10:49.845] [log] [info] NO_HINGE_REGION = 500
[2017-11-29 03:10:49.845] [log] [info] HINGE_MIN_SUPPORT = 7
[2017-11-29 03:10:49.845] [log] [info] HINGE_BIN_PILEUP_THRESHOLD = 7
[2017-11-29 03:10:49.845] [log] [info] HINGE_READ_UNBRIDGED_THRESHOLD = 6
[2017-11-29 03:10:49.845] [log] [info] HINGE_BIN_LENGTH = 200
[2017-11-29 03:10:49.845] [log] [info] HINGE_TOLERANCE_LENGTH = 100
[2017-11-29 03:10:50.025] [log] [info] name of las: hinge.las
[2017-11-29 03:10:50.025] [log] [info] Load alignments from hinge.las
[2017-11-29 03:10:50.025] [log] [info] # Alignments: 1527933550
/HINGE/inst/bin/hinge: line 8: 12721 Killed                  Reads_filter "$@"
hinge filter --db hinge --las hinge.las -x hinge --config /HINGE/utils/nominal.ini did not produce a return code of 0, quiting!
govinda-kamath commented 6 years ago

Would it be possible to split the las file and run HINGE with --mlas ?

alimayy commented 6 years ago

Hi Govinda, it seems like it worked, I'll let you know after I have a more thorough evaluation. By the way, what I also realised is the huge number of 'las's that are produced during

HPC.daligner -t5 -T32 hinge| csh -v > /dev/null 2>&

(see attached)

I have the feeling that this causes the pipeline to take too long. Would you agree? hinge_las