Evaluate other assemblers

We've had reasonably good results with IVA, but it has a few problems:

Some samples are very slow, even taking multiple days. Random primers seem to be slow.
We find contigs with a lot of repeats, possibly caused by primer dimers. That may or may not be a problem with the assembler.
The IVA project is not really maintained anymore, so it may be wiser to find another tool.

A brief search for alternatives turned up SPAdes, as mentioned in an article by Sutton, as well as tadpole, mentioned in a discussion forum.

An interesting sample for experimenting on is D62201-HCV_S3 from the 26 Feb 2016.M04401 run. It looks like a mixed HCV infection with two or possibly three strains to assemble.

Edit: To Do List

[x] Investigate where IVA is struggling.
[x] Try Spades.
[x] Try Abyss.
[x] Use ntJoin to scaffold the contigs to a reference genome.
[x] Create a pessimistic mode for IVA for a quick attempt at assembly.
[ ] Improve IVA's pessimistic mode by making it use the filtered reads in each step.
[ ] Update to latest IVA fixes.
[x] Try Velvet.
[x] Optimise Velvet's input parameters / automate the determination of the parameters for each sample.
[x] Try Haploflow.
[ ] Use contig merging and scaffolding to improve Haploflow's assembly results.
[ ] Improve RP results for Haploflow.
[ ] Compare IVA and Haploflow quantitatively.

Here are some more interesting samples to experiment with:

2929-P13-Y02944-7G-NFLHIVDNA_S53 from the 19-Feb-2021.M05995 run assembled three HIV contigs, one of which completely overlaps the main contig. It also assembled in about an hour.
73095A-HCV_S3 from the 15-Jul-2016.M01841 run looks like a mixed infection of HCV 1b and HCV 2b, with over 1000 bases of HCV 1b deleted. Why didn't the HCV 2b assemble into one contig? Assembled in about half an hour.
73060A-HCV_S46 from the 15-Jul-2016.M01841 run looks very similar, but the HCV 2b assembled into one contig, and the HCV 1a assembled into three contigs. Assembled in about half an hour.
One of the worst cases is 73053B-RP-HCV_S2 from the 28-Jul-2016.M04401 run, that produced a ton of contigs, none of which looked useful. There was one contig that was a partial match to HCV 3k. The worst part is that it took over 11 days to assemble. Random primers seem to make assembly harder.
Another random primer sample from the same run is 94019A-RP-HCV_S5. It took over 9 days to assemble, but at least it assembled a full genome of HCV 3a. Unfortunately, it was in four separate contigs: some with gaps and some with small overlaps.
73090A-HCV_S41 from the 15-Jul-2016.M01841 assembles beautifully, but the sample appears to be a chimera of HCV 1a and another genotype.

At first glance, a good candidate to substitute IVA seemed to be Savage (especially combined with Virus-VG). It was created specifically with the challenges of viral assembly in mind, and takes into account that a virus typically consists of several strains or quasispecies. Savage performs a de novo assembly to create contigs and Virus-VG assembles the contigs into complete genomes, estimating the ratio of the different variants. However, Savage had been tried out before here: #442, and it appears to take a longer time to assemble than IVA for certain samples, uses a lot of memory, and requires careful setting of split counts. For future reference, Savage has been moved to Github. The version that had been tried out earlier was 0.4.0, and the current version on Github is 0.4.2. For now, we will be trying other assemblers instead.

Here's another problem sample: 2929-P14-Y02944-10D-NFLHIVDNA_S8 from the 19-Feb-2021.M05995 run. I ran it twice and got different contigs. Both times assembled an unknown contig with low coverage plus some HIV contigs. The first time produced a single HIV with a huge deletion from the start of gag to partway through GP41, and the second time produced two contigs with an overlapping region that combine to match the first time.

Unrepeatable results could be a big problem, so we should try to avoid them. I think we had noticed problems when we first started using IVA, but I added an option to set the random number seed, and I reduced the number of threads to two. It's possible we might have to reduce the threads to one. Please try rerunning the analysis on this sample a few times to see if you can reproduce the problem.

@CBeelen reran sample 2929-P14-Y02944-10D-NFLHIVDNA_S8 541 times and it assembled two contigs every time. I'm going to review the results in Kive to make sure they really came from the exact same version of the pipeline. If so, I'll rerun the singularity container a bunch of times to see if I can reproduce the problem.

We have decided to give SPAdes a try as de novo assembler, as it performs well in benchmark tests and is commonly used. The Github branch of MiCall with SPAdes can be found here.

I experimented with a few different options for SPAdes (--isolate --rnaviral, --careful, --sc, different values for kmer lengths), but the best results turned out to come from running SPAdes in standard mode. Adding the option --cov-cutoff 20 to ignore edges with low coverage seemed to help a lot to ignore low-coverage results, but not in all cases. It might be worth exploring increasing the cutoff even more.

SPAdes has two output files: contigs.fasta and scaffolds.fasta. They recommend to use the scaffolds as output, so that is what I did (they are occasionally longer and fewer than the contigs).

Fragmented assemblies turned out to be a large problem. The output frequently consisted of several contigs that overlapped with identical sequences, but were not merged by SPAdes. Thus, I decided to add a few parts of the IVA pipeline to attempt to merge the contigs that SPAdes outputs. Specifically, I used Assembly._trim_strand_biased_ends() to trim the contigs' ends, Assembly._remove_contained_contigs() to remove contigs that are completely contained in another contig, and Assembly._merge_overlapping_contigs() to merge overlapping contigs. I had to tweak IVA's code a little here to change the requirements for merging: in mummer.run_nucmer(), which finds overlapping contigs, I kept the requirement of 95% sequence identity and lowered the minimum sequence length to 80. I called NucmerHit.to_graph_edge(), which adds edges (overlapping contigs) to the assembly graph, with the same parameters min_overlap_length=80, min_identity=95 instead of the default min_overlap_length=200, min_identity=99. This led to much improved assemblies compared to the results before merging. We now achieve results that are as good as IVA's results in some cases, but not all.

Interestingly, many of the contigs that SPAdes does not merge but that do get merged by the post-processing have an overlap of exactly 127 bases, the largest kmer size that SPAdes works with.

Specifically, the "nicely assembling" samples I investigated were:

2929-P7-Y02944-6B-NFLHIVDNA_S82 from 19-Feb-2021.M05995: IVA assembled 1 contig. SPAdes plus merge assembled 3 contigs (but the genome was completely covered by one).
2929-P9-Y02944-3C-NFLHIVDNA_S49 from 19-Feb-2021.M05995: IVA assembled 1 contig, and so did SPAdes plus merge.
2929-P14-Y02944-4E-NFLHIVDNA_S66 from 19-Feb-2021.M05995: IVA assembled 2 contigs. SPAdes plus merge assembled 7 overlapping contigs (interesting to look at further!).
73090A-HCV_S41 from 15-Jul-2016.M01841: IVA assembled 1 contig. SPAdes plus merge assembled 6 contigs, but the genome was in one contig. Without merging, SPAdes assembled 67 contigs.
1693NFLSGAP12-K20-NFLHIVDNA_S3 from 19-Feb-2021.M01841: IVA assembled one contig, and so did SPAdes with merge (and even without merge).

And the "difficult assembly" samples were:

D62201-HCV_S3 from 26 Feb 2016.M04401: IVA assembled 30 contigs. SPAdes plus merge assembled 510 contigs (very fragmented assembly).
2929-P13-Y02944-7G-NFLHIVDNA_S53 from 19-Feb-2021.M05995: IVA assembled 4 contigs. SPAdes assembled 348 contigs without merging and 24 contigs with merging.
73095A-HCV_S3 from 15-Jul-2016.M01841: IVA assembled 5 contigs. SPAdes plus merge assembled 234 contigs. There was a lot of low coverage stuff.
73060A-HCV_S46 from 15-Jul-2016.M01841: IVA assembled 8 contigs. SPAdes assembled 134 contigs without merging and 11 contigs with merging.
73049AWG-HCV_S7 from 15-Jul-2016.M01841: IVA assembled 4 contigs, three of which cover the HCV 3a genome. SPAdes plus merge assembled 114 contigs, but one of them spanned almost the whole HCV 3a genome. The others are all very small or partial matches to other HCV types.
73076BWG-HCV_S71 from 15-Jul-2016.M01841: IVA assembled 4 contigs. SPAdes plus merge assembled 44 contigs, but many of them were partial matches and the main genome contigs looked similar to the IVA result.
73049BWG-HCV_S45 from 15-Jul-2016.M01841: IVA assembled 6 contigs, 3 of which overlap. SPAdes plus merge assembled 183 contigs, but merged the contigs that overlap in IVA. It looks like a misassembly though, interesting to look into.
Random primer samples 73053B-RP-HCV_S2 and 94019A-RP-HCV_S5 from 28-Jul-2016.M04401: they caused even more problems in SPAdes than in IVA. A ton of contigs got assembled (1523 for 73053B-RP-HCV_S2 and 2204 for 94019A-RP-HCV_S5). Merging the contigs only led to a slight reduction in their number. This causes sam2aln to crash because it opens too many files at once.

In conclusion, it looks like IVA is still performing better than SPAdes in the current configuration. There are two options worth exploring next, apart from SPAdes:

Tweaking IVA: by decreasing the required sequence identity and length to merge contigs, more overlapping contigs could get merged in IVA as well. Furthermore, I found two main reasons that IVA struggles in some cases: first, it tries to extend all old contigs in every seed extension iteration. This is very costly, especially for RP samples that produce a ton of contigs. Second, it gets stuck when there are two variants with similar proportions and cannot extend the seed past that mutation (but keeps trying). If we want to tweak IVA, those are two mechanisms that might be worth trying to modify.
Trying another assembler: Tadpole doesn't seem worth checking out at this point because it performs worse than SPAdes in tests, but ABySS might be worth giving a try. Also, the metagenomic pipeline of SPAdes (metaSPAdes) might perform well for random primers.

The next assembler I have given a try is Abyss.

Abyss works very well on the HIV samples I have tested (see above), with comparable results to IVA: long contigs, mostly one or two, spanning the whole genome.

However, for the HCV samples, it had more trouble. It assembled a large amount of contigs (>10,000) in its standard mode, but fewer in bloom filter mode. The higher k (kmer size) is set, the fewer contigs were assembled, and the longer they got. I got down to a few hundreds to tens of contigs with k as high as 235. However, this also led to parts of the genome getting lost, especially if they have lower coverage. When trimming and joining the contigs using parts of the IVA pipeline as described above, the number of contigs and their length improve, but I have not been able to achieve better results than IVA. The best parameters so far have been k between 180 and 200, and kc at 3 or 4, but these may have to be adjusted downwards to prevent the loss of parts of the genome. When k is set too low, the contig joining fails due to an assertion failing. For the random primer samples, a few hundred contigs got assembled with parameters as specified above, but the joining failed.

Next, I will pursue two different strategies in parallel, see To Do list in the first comment: using ntJoin and creating a pessimistic mode for IVA.

Using ntJoin: the aim is to scaffold the resulting contigs from Abyss (or IVA/Spades) to a reference genome, hopefully resulting in a more contiguous assembly. We need to be careful about chimeric and mixed infections in this case.
Creating a pessimistic mode for IVA, in particular for random primer samples: This mode should perform a quick attempt at an assembly, so we can then discard anything that does not map to the reference genomes. Afterwards, we could run regular IVA to create a full assembly of the remaining reads. This may avoid the problem of contigs that don't map to any reference genome taking a lot of time to assemble, but at the same time it may avoid throwing away parts of the genome to be investigated that do not match the reference (which would happen when mapping the raw reads to the reference genome).

While creating a pessimistic mode for IVA, I happened to create a microtest example that showcases or triggers IVA's non-deterministic behaviour (this is 2240-HIVHCV-mixtures_S32 on the Abyss branch of Micall).

I wanted to create a sample that would cause IVA to struggle - specifically, the sample consists a mixture of reads from two HIV sequences and one HCV sequence (the HIV sequences don't overlap). Half of the reads for each sequence have a mutation at a specific site. IVA then struggles to assemble the contigs beyond this mutation, because the two variants are present on equal levels.

When testing the IVA output for this sample, I noticed that it is non-deterministic: when re-running the sample, different numbers of contigs get assembled. The contigs may break at the mutation site, or there can be a misassembly, or a contig may be (partly) missing. This behaviour only shows up when running IVA with more than one thread - I ran 10 repeats of the analysis and got the same result each time with one thread, while 10 repeats with 2 threads gave me 4 different results. Interestingly, IVA behaves more deterministically when assembling from the interleaved files rather than the non-interleaved files (the trimmed reads, or the reads directly from the sample): when running 10 repeats with the interleaved file, I got basically the same result even with two threads, but with slightly varying read depths for each result.

In a strategy discussion today, @CBeelen and I agreed on trying out a technique that will assemble one contig at a time, use BLAST search to decide if the contig matches any of the reference sequences, and discard any unrecognized contigs, along with any reads that map to them. The hope is that really noisy samples like random primers will stop spending so much effort trying to extend the contigs we don't care about. It's possible we could then put all the contigs back into the mix when we finally map all the reads onto the contigs.

As a separate idea, I looked for other de novo assemblers and found Velvet. It sounds like some of the other tools we tried, based on de Bruijn graphs. I added an item for trying Velvet to this issue's task list.

I just discovered a fix in IVA that isn't in our fork. It now supports samtools versions 1.10 and above. The fix that we originally forked for has been merged upstream, so we can probably switch back to the upstream repository. Wait until you've finished experimenting with changes to IVA, and then either merge the upstream fixes into our fork, or switch back to the upstream repository.

Since the last update, I have tried out a few different things:

First, I tried to scaffold the assembly results of Abyss using ntJoin. However, this was not very successful: ntJoin was not able to improve the assembly by much. Abyss often outputs multiple contigs covering the same genome region, probably because there was no clear resolution for mixtures / variants. ntJoin was unable to resolve most of these issues, probably because it relies on the same method of assembly (de Bruijn graphs) as Abyss itself.

Next, I created a pessimistic mode for IVA which only tries to extend one contain at a time. The idea was that this would prevent IVA from trying to extend the old contigs again and again, often without success in case it hit a mixture of approximately 50:50 ratio. This does work, but it does not save a lot of time. I found out (using a runtime analysis) that IVA spends most of its time mapping all the reads to the existing contig(s) - it uses a filtered set of reads that don't map to any of the old contigs most of the time, but maps all of the reads to the contig(s) every once in a while. Especially for the random primer samples, this mapping step takes a long time. Thus, I decided to call IVA iteratively and have it assemble one contain each time it is called, using a subset of the reads - only those that do not map to any of the previously assembled contigs. First of all, I separated the reads within Micall using the function map_to_contigs from remap.py. It is also possible to keep all the reads that do not map to a contig yet, and those that map to the useful contigs (those that map to the reference genomes). This helps to take out all of the uninteresting reads for the random primer samples. However, IVA already filters the reads for us, so I next used IVA's filtered reads instead, by preventing them from being deleted after IVA is finished with the assembly. This works decently well. The only problem is that IVA - being iteratively called to assemble one contain at a time - does not remember which seeds it has tried unsuccessfully so far, and tries unsuccessful seeds again and again. At the current state, the iterative pessimistic version of Micall does not assemble any quicker than the traditional IVA. The next step would be to make IVA itself iterative and pessimistic by making it look at only one contig at a time (like before) and having it only use the filtered reads in each step. Like this, we can use the existing functionality of IVA where it remembers which seeds have been tried before.

In the mean time, however, I also tried out Velvet, which happily surprised us by being very successful at assembling our samples. The main advantages are that it assembles very quickly (on the order of a few seconds to minutes) and is well-documented, with good explanations for the parameters and an extensive manual. I have performed a few parameter sweeps to determine the best parameters, which are currently: insert length 70, hash length 31, expected coverage 7000, coverage cutoff 70, and minimum contig length 200. The most sensitive parameters are the coverage cutoff, the hash length, and the estimated insert length. For many of the samples named above, Velvet performs as well as or better than IVA in terms of contig length and genome coverage. Some samples that show nice results with the current parameters are 2929-P14, 73049AWG, 73049BWG, 73060A, 73076BWG, and 73053-RP (full sample names and run info, see above). Decent results are achieved for 1693NFLSGAP12, 2929-P7, 73095A, and 94019A-RP. Unfortunately, there are also some samples where Velvet struggles - these are 2929-P9, 2929-P13, D62201, 94047A-RP, and 94075A-RP (from the same random primer run as the original RP samples). We were able to identify two main problems: first, Velvet can lose parts of the genome if the contigs do not meet the cutoff length (e.g. for 2929-P13). Second, it sometimes creates small deletions that are not reflected in the reads (e.g. for 94019A-RP, as can be verified by looking at the NS5a.details plot). So, as a next step, I will be investigating how exactly Velvet works and analysing the output a little more closely, to hopefully identify why it is struggling with certain samples and how we can improve the assembly results. In particular, I will be looking into the coverage cutoff and the insert length to see whether I can determine them a little more rigorously based on the samples. I also want to find out whether these parameters could be automatically determined for each sample, allowing for better overall results than a "one size fits all" approach of using the same parameters for each sample. For now, the pessimistic IVA implementation will be on hold while I am attempting to make Velvet work better, which seems to be the most promising assembler at this point.

I saw mention of another assembler that sounds interesting: Haploflow.

I am currently experimenting with Haploflow, which is similarly fast to Velvet and produces promising results. Here are some interesting resources for further reading on Haploflow: the publication and the Github repo containing the scripts for the supplementary material. The latter also contains a script for scaffolding the assembly given a reference sequence (which could also be the consensus sequence of the assembly), which I will try with our data as well.

Trying to automate Velvet's input parameters was not very successful: the idea was to attempt an initial assembly and determine the coverage cutoff as well as the expected coverage based on the coverage histogram of the resulting contigs, weighted by their length. However, the coverage histograms did not really show a clear dip in the coverage between short, low-coverage and long, high-coverage contigs (which is where the cutoff should be placed). Therefore, it was impossible to determine the cutoff automatically from coverage histograms. Velvet did produce quite nice assembly results most of the time and is definitely worth keeping in mind, especially because of its speed. For now though, I decided to focus on Haploflow. It is quite promising, because it is geared towards samples that contain multiple strains, and uses a combination of a de Bruin graph and a flow algorithm to resolve the ambiguities that result from the presence of multiple strains. This is interesting for our samples, because we know that other de Bruin graph-based assemblers have struggled with the presence of multiple strains and IVA can struggle if strains are present at roughly equal proportions. The initial results with default parameters look quite promising. Haploflow usually assembles a couple of contigs, many of which cover the same part of the genome, but represent different strains. When analysing the contigs in details, these strains sometimes show larger differences (such as insertions or deletions), and sometimes they only differ in one base. The current best parameters for the assembly are: --long 1 ( so Haploflow will try to assemble contigs that are as long as possible) and --strict 3 (regulating the strictness of error corrections). Setting the error rate --error-rate differently leads to better results for some of the samples, but worse results for others - so for now I am leaving it at its default value of 2%. Changing the kmer size --k also did not lead to a clear improvement, so it is staying at its default value of 41 for now. Using the option --debug 1 produces more output files, most notably graphs in the .dot format. Those can be analysed, leading to interesting insights into how the algorithm removed paths (= contigs) from the assembly. In some cases the graph is way too large to be instructive, though. There is also an option --log, to write the log to a file, and an option --dump-file to dump out the entire de Bruin graph (I have not tried the latter, and it can produce huge files so proceed with caution). Currently, I am proceeding with Haploflow and using IVA's functions for merging contigs that overlap or are contained within another. Furthermore, we want to try scaffolding the assembly, possibly filling gaps in the contigs with the reference genome. For that, I want to give ntJoin another try - I had previously tried it on Abyss's output, which was not very promising because the scaffolding was not able to resolve multiple contigs that covered the same part of the genome. However, a combination of merging and scaffolding contigs might work well. Previously, I had (unsuccessfully so far) tried to use Haploflow's own scaffolding script from the supplementary material. While trying that, I used Quast, a tool to asses the quality of genome assemblies, optionally based on a known reference. This tool, created by the team behind SPAdes, is quite useful and might be interesting for further (more quantitative) investigations of other assemblers, or to create a quality score for Micall's assembly result.

Also, for future reference: another potentially interesting assembler is Megahit, which appears to be very fast and does not require much memory. It might also struggle with multiple strains though, as its strategy is to terminate a contig at an ambiguous point. It might also be worth taking a step back to Spades at some point: when looking back through my notes, I noticed that I had noted down that the best results were achieved with a coverage cutoff of 20, and that investigating lower coverage cutoffs might be worthwhile. With my experience from Velvet and a better understanding of what the coverage cutoff actually means, this does indeed seem promising: for Velvet, the best results were achieved with a coverage cutoff of 70, and higher coverage cutoffs generally helped to prevent fragmented assemblies.

Using Haploflow in combination with IVA's functions for merging contigs post-assembly yields some pretty good results. For most non-RP samples, the results look similarly good as IVA's results, except for being sometimes distributed in multiple contigs.

I tried scaffolding the assembly and picked RagTag. It has different modes for scaffolding (i.e. ordering the contigs and filling gaps with "N") and patching (i.e. ordering the contigs and filling gaps with reference genome). However, it was not straightforward to get it to actually improve the assemblies, and it seems like it would require some more careful setting of the different parameters to get it to do something reliably. I am leaving it on the To Do list for now, because I still think that it would be interesting to invest some more time in.

For the random primer samples, I implemented a mode which runs a first attempt at assembly and then separates the reads according to their usefulness. Reads that mapped to contigs that are a blast match to the references are considered useful, while reads that map to contigs that were no blast matches to the references are discarded. The reads that did not map to anything are kept as useful as well, because they might be necessary to fill gaps between contigs. Then, a second attempt at assembly is performed with only the useful reads. This works pretty well in terms of reducing the number of assembled contigs, and helps Haploflow to at least assemble contigs that map to the reference - before using this method, it had sometimes missed large bits of the genome altogether. However, the assembly can still turn out pretty fragmented and sometimes has small gaps between contigs. Also, there are still contigs being assembled that don't match the reference - so there must be some useless reads left over, probably among the unmapped reads. To solve this, I will try to iterate through the read-filtering and assembling procedure at least once more. If that does not help, it might be nice to scaffold the contigs instead, so in that case I will turn my attention back to RagTag.

Other than that, we now want to do a more quantitative comparison of IVA and Haploflow. For that, I will do a more quantitative runtime analysis, and a quantitative comparison of the assembly results. I will look into using Quast to compare the quality of the assemblies.

cfe-lab / MiCall

Evaluate other assemblers #665