Closed sjin09 closed 7 years ago
Dear Jason,
In addition, I also wanted to ask whether during the pread formation, it is possible to retain the information from contiguous and long subreads as much as possible. If we assume that the BAC clones are on average 100,000 bp and if there is a subread that is greater than 40,000 bp, then almost half of the BAC clone is represented in a single subread. However, there are not enough coverage for these long subreads and they are fragmented during the pread generation. Would reducing the --min-cov 1 to 1 be a sufficient solution?
Best, Jin
@sjin09
while FALCON follows the general principle for assembly and can be tuned to assembly BACs, it is not the common use for it.
Why?
For assembling large genome, we typically use some coverage measurement to avoid excessive computation for repeat elements in a genome. Such strategy will only work if the genome assembly coverage is uniform. In our experience, to have perfect mixtures of different BAC is hard. Namely, you have to tune off most coverage filters to avoid filter out useful stuff for BAC assembly.
2nd thing is the one you mention in the 2nd post. The BAC size/pread length ratio is high. Namely, typically, you should be able to assemble the BAC with 3 to 10 reads. This is also not the "typical" case for large genome assemblies. For example, the standard output of contigs in FALCON left out the first reads as it is possible the first read is used by other contigs. This will not be ideal for BAC assembly as the first read can be half of the BAC. This can be fixed by simple post-processing though. (And yes, if you have only the longest reads with low coverages on a BAC, --min-cov 1 is good.)
@rhallPB has some experience on BAC assembly. (I have some too but I have to do special tuning.) I think he might be able to give you more insight and some useful ideas.
From my experience, while it is possible to parameterize Falcon to assemble BAC pools, it is much easier to use HGAP.3. In HGAP.3 the parameters are much more robust to this kind of assembly. What level of pooling are you doing, and what is the library procedure? If the BAC vector is going to be easily spanned by the reads, it is worth leaving it in the data, and validating assemblies by looking for circular structures in the graph.
Dear @pb-jchin and @rhallPB
I would first like to thank you for the swift response.
I wanted to use FALCON as I found it to be faster than canu, but canu consistently gave better results in its automated mode. I also liked the fact that I could visualize falcon's output while the canu's output in gfa cannot be visualized with bandage effectively. The BAC clones are mainly derived from chromosome X and chromosome Y, which I think will be highly repetitive and may require multiple attempts to obtain the best results. This is where I thought FALCON's speed would be immensely useful.
@rhallPB: in a single pool, there are approximately 48 BAC clones, and the BAC clones have been sequenced with a variable coverage (from 70X to 200X) after E.coli and Vector sequence removal in each pool. The BAC clones have been sequenced with P6C4 on PacBio RSII. I thought that HGAP would be a good choice as Eichler's group have been consistently using it for their BAC assembly, but I thought the BLASR alignment for read to read comparison would be too slow for multiple attempts.
We observed that a single subhead sometimes indeed have vector sequences that is spanned by the subhead and flanked by the BAC insert sequences. We thought we will use this for BAC assembly assessment. We additionally have BAC end sequences to align to the contig for assembly assessemnt. We have found that since the coverage of vector sequences are higher than the BAC insert coverage, the vector sequences acts as repeats, and creates problems for the assembly as noted here (https://github.com/KiddLab/pacbio-vector-trim).
I will try the assembly again with HGAP.3 and compare the results with that of canu. I eagerly wait for further advices.
Best, Jin
DBstats results from the subreads.
Statistics for all reads of length 500 bases or more
110,046 reads out of 160,758 ( 68.5%)
945,177,660 base pairs out of 1,220,841,447 ( 77.4%)
8,588 average read length
5,129 standard deviation
Base composition: 0.296(A) 0.212(C) 0.212(G) 0.281(T)
Distribution of Read Lengths (Bin size = 1,000)
Bin: Count % Reads % Bases Average
48,000: 1 0.0 0.0 48637
47,000: 1 0.0 0.0 47820
46,000: 0 0.0 0.0 47820
45,000: 3 0.0 0.0 46339
44,000: 1 0.0 0.0 45980
43,000: 5 0.0 0.1 44816
42,000: 1 0.0 0.1 44639
41,000: 4 0.0 0.1 43857
40,000: 10 0.0 0.1 42594
39,000: 7 0.0 0.1 41924
38,000: 7 0.0 0.2 41327
37,000: 8 0.0 0.2 40697
36,000: 16 0.1 0.3 39660
35,000: 13 0.1 0.3 38932
34,000: 14 0.1 0.4 38249
33,000: 23 0.1 0.4 37292
32,000: 30 0.1 0.6 36290
31,000: 35 0.2 0.7 35339
30,000: 37 0.2 0.8 34507
29,000: 46 0.2 0.9 33623
28,000: 57 0.3 1.1 32699
27,000: 67 0.4 1.3 31795
26,000: 94 0.4 1.6 30751
25,000: 119 0.5 1.9 29698
24,000: 143 0.7 2.3 28687
23,000: 185 0.8 2.7 27648
22,000: 258 1.1 3.3 26518
21,000: 345 1.4 4.1 25376
20,000: 505 1.8 5.2 24161
19,000: 695 2.5 6.6 22966
18,000: 970 3.4 8.5 21786
17,000: 1,474 4.7 11.3 20555
16,000: 2,205 6.7 15.1 19333
15,000: 3,223 9.6 20.4 18159
14,000: 4,656 13.9 27.5 17032
13,000: 6,283 19.6 36.5 15993
12,000: 7,847 26.7 46.8 15059
11,000: 7,819 33.8 56.3 14312
10,000: 6,940 40.1 64.1 13714
9,000: 6,405 45.9 70.5 13181
8,000: 6,312 51.7 76.2 12661
7,000: 6,433 57.5 81.3 12136
6,000: 6,943 63.8 86.1 11579
5,000: 7,519 70.7 90.4 10991
4,000: 7,534 77.5 94.0 10417
3,000: 7,306 84.1 96.7 9872
2,000: 7,308 90.8 98.7 9332
1,000: 6,891 97.0 99.7 8827
0: 3,248 100.0 100.0 8588
The HGAP.3 job wont take too long on a reasonable computer. To convert the output of HGAP.3 into a graph file format that can be visualized in gephi https://gist.github.com/rhallPB/2d962e700d83270b0109
Dear @rhallPB
I have been successful with the HGAP3 de novo assembly in a de novo assembly without sungrid engine. The de novo assembly, however, failed in the runCaToUnitig.log when the sungrid engine was used with memory issues.
If possible I wanted to test whether HGAP4 with its improvements can be performed on the sun grid engine. However, the whitelist tutorial seems to be not directly applicable with the HGAP4 pipeline where the preset.xml has to be changed instead of settings.xml. The preset.xml also seems to very different from the settings.xml that I modified to run HGAP3.
I would appreciate any directions for producing a whitelist for BAC de novo assembly using HGAP4.
p.s: it seems that XML file cannot be posted directly.
At this time we do not have a workflow for white listing on HGAP.4. preset.xml and settings.xml are completely different parameter file formats. HGAP.4 is an implementation of Falcon using the pbsmrtpipe workflow engine, you can get the same results using Falcon (which supports sge), and simply filtering the reads in the input fasta files, but as stated previously Falcon likely won't give as good results as HGAP.3, without a lot of testing and optimizing of parameters.
Dear @rhallPB
Thank you for the swift response.
I have been tuning HGAP3 under smrtanalysis_2.3.0.1409 for BAC de novo assembly. I have been tuning the computeLengthCutoff (true/false), minLongReadLength (4500-10000bp) and minSubReadLength (3000-4500bp) to optimize the BAC de novo assembly. However, the results were not as satisfactory compared to the BAC de novo assembly with CANU. The CANU assembly had an average length of 90 kb whereas the HGAP assembly produced contigs with an average length of 50 kb.
What kind of other parameter optimization do you suggest to achieve better BAC de novo assemblies? The coverage for the BACs are on average ~50X.
Best, Jin
Dear @rhallPB
In addition, I wanted to inquire whether whitelisting for QUIVER is possible. If the HGAP results does not produce the desired results, I wanted to perform QUIVER on contigs from CANU without the subreads derived from E.coli genomic DNA.
Regards, Jin
To optimize this kind of assembly, where the coverage for individual BACs is variable, and the repeat content complex, I would visualize the overlap graph in gephi (see post above) to see if the issue is fragmentation in the graph, vs. complex overlaps (knots, due to repeats not being spanned). Fragmention in the graph would mean that you need to change parameters that allow more coverage in the overlap graph layout (lower seed read, lower subread cutoff, lower coverage required for preassembled read generation), complex overlaps would suggest you need to increase the seed read length, subread cut etc. Obviously given a not very equal pooled sample, these parameters may be different for different BACs in the pool, therefore it may be necessary to use different parameters for different BACs. Whitelisting for quiver is possible, it's done in exactly the same way as whitelisting for HGAP.3, but unless the BACs contain sequence with significant homology to ecoli I wouldn't worry about it. No ecoli data is going to map to your BACs if you use reasonable parameters for the alignment.
Cheers @rhallPB.
I will try the parameter optimization as you have mentioned.
Dear @rhallPB
Would it be correct to assume that "lower coverage required for preassembled read generation" would be equivalent to lowering the value for param name="xCoverage" label="Target Coverage"in the settings.xml.
Or were you referring to raising the values for param name="minCorCov" label="Minimum Coverage for Correction"?
In addition, I am trying to maintain the ends of the BAC insert as much as possible as I would like to align BAC end sequences back to the BAC contigs for identification of the BAC origin. The ends of the BAC would probably have lower coverage than other parts of the BAC contigs. Is there a parameter that I should be particularly careful for this setting?
minCorCov
Are you trimming out the vector before assembling? If not, then there is no reason that the ends would have less coverage. If you are trimming out the vector, then simply don't trim out the full vector sequence until after assembly, i.e. leave a few hundred base pairs of the vector on each side, so that the BAC end is never actually the end of the sequence.
Dear @rhallPB
I have been able to obtain a better assembly by reducing the lower seed read and minimum subread length as you have recommended. I am now optimizing the assembly to better represent the ends of the BAC inserts.
I have aligned .bax.h5 files to the vector sequence using BLASR. Thereafter, I have masked the portions of the subreads in the .bax.h5 with vector sequence by modifying the high quality sections in the region table. I will try unmasking some of the sections that has matched to the vector sequence.
Best, Jin
Thank you for all the help. I think I got the BAC de novo assembly working reasonably well with the parameter optimizations that you have suggested. I have closed the issue.
Dear Jason,
I am using your FALCON de novo assembly algorithm towards BAC clone de novo assembly. The BAC clones have been pooled and have been sequenced with 2 SMRT cells to obtain an effective coverage of ~70X after removal of E.coli and vector DNA sequences. There are many BAC clone pool that we wish to assemble and scaffold using BioNano Genome Map. In addition, each pool of BAC clones will have different repetitive content and therefore, we think would require slight adjustments for each assembly attempt.
I am wondering if you have any recommended parameter adjustments to optimize the assembly of BAC clones. I am trying the assembly with various parameters and visualizing the assembly with BANDAGE to observe the effect of each parameter.
Any advice would be highly appreciated.
Best, Jin