HingeAssembler / HINGE

Software accompanying "HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution"
http://genome.cshlp.org/content/27/5/747.full.pdf+html?sid=39918b0d-7a7d-4a12-b720-9238834902fd
Other
64 stars 9 forks source link

HINGE with PacBio CCS reads #104

Open alimayy opened 7 years ago

alimayy commented 7 years ago

Hi all,

I've been trying to run HINGE using Circular Consensus Sequencing reads that are obtained after a Sequel run. As expected the number of reads decreases drastically due to CCS, but on the other hand the read quality increases. After CCS I end up with 7,344 reads of total length 25.8 Mpb (~8x coverage in this case).

I tried to lower all coverage-related parameters in the nominal.ini and run HINGE on these high-quality sequences but unfortunately it didn't work out. It appears as if it's failing at a rather early stage before draft assembly. I'm attaching the log file. Could you please have a look and let me know?

Many thanks in advance,

Ali

log_2017-03-21_16-08.txt hingehinge_run_id.G00.zip

ilanshom commented 7 years ago

Hi Ali,

The idea of applying HINGE to CCS reads is quite exciting to us. But it is clear that significant changes would have to be made to the parameters, as the current pipeline assumes a significantly higher coverage. Is there any test data that you can share (or that is already public) that we can use to adjust the pipeline to that scenario? That would be helpful.

Thanks,

ilan

alimayy commented 7 years ago

Hi Ilan,

I'm glad to hear that you see some potential in the CCS assembly idea. Unfortunately I cannot share the data as it belongs to commercial companies (if I happen to know a client who might not mind sharing their data, I will ask). Downside of companies in terms of data freedom... During my phd I had some frustrations due to the lack of publicly available datasets, so I assure you that I'll do my best to provide you guys with as much as I can.

Good news is HINGE did work on two other CCS datasets. The results however aren't very good, most probably due to low coverage. I'm attaching the parameter file I used for the assemblies.

nominal.ini.txt

I can share some statistics:

Sample I Reference length: 3.36 Mbp

Subreads: nr of subreads: 160,475 Mean subread length: 3,483 bp HINGE assembly length : 3,012,797 bp nr of contigs: 57 N50: 64,719 bp

CCS: nr of CCS reads: 10,692 nr of CCS bases: 41,865,011 CCS Read Length (mean): 3,915 bp HINGE assembly length: 1,282,901 bp nr of contigs: 55 N50: 23,722 bp

Sample II Reference length: 3.83 Mbp

Subreads: nr of subreads: 378,143 Mean subread length: 3,745 bp HINGE assembly length : 3,791,504 bp nr of contigs: 21 N50: 234,169 bp

With CCS: nr of CCS reads: 26,761 nr of CCS bases: 109,496,357 CCS Read Length (mean): 4,091 bp HINGE assembly length: 2,679,816 bp nr of contigs: 97 N50: 30,789 bp

As I find the time I can experiment with more datasets. Let me know whether it can be helpful in any way.

ilanshom commented 7 years ago

Thanks, Ali. We would suggest some changes to account for the shorter reads:

[filter] length_threshold = 1000; —> 500 aln_threshold = 1000; —> 300 cut_off = 300; —> 100 theta = 300; —> 100

[consensus] min_length = 4000; —> 500 trim_end = 200; —> 60

[layout] hinge_slack = 1000 —> 100 min_connected_component_size = 8 —> 3

These changes are trying to account for the smaller coverage depth and for the shorter reads. Without having real data, these are basically guesses, but I would expect them to improve your results.

By the way, how are you running DAligner on this? Do you use a specific set of parameters for CCS reads?

alimayy commented 7 years ago

Thanks a lot for the tips Ilan. I've done some tests with the new parameter set using i) Sequel CCS reads and ii) low-coverage Sequel data (subreads). There is significant improvement in terms of assembly length with low-coverage subreads. I also see some improvements with the CCS reads. I'll structure the results and share them as soon as I can.

No, I didn't change the parameters of DAligner for CCS reads. Do you have any suggestions?

ilanshom commented 7 years ago

I think you should change the -l parameter to something like 500 for both CCS reads and the subreads. Then, if you are aligning CCS reads to each other, you need to also adjust the other parameters for the lower error rates. You may want to try -k25 -w5 -h70 -e.95 -s500 which is what Gene Myers suggests for corrected reads. But this may be too stringent, as the error rates for CCS reads may not be that low. If the number of alignments found with the above parameters is too small, you may want to consider something like -k20 -h55 -e.85 which is used for aligning noisy reads to a clean reference. As a thumb rule, if your error rate is p, you want to set e = 1-2p.