PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 103 forks source link

BAC assembly with FALCON #446

Closed sjin09 closed 7 years ago

sjin09 commented 7 years ago

Dear Jason,

I am using your FALCON de novo assembly algorithm towards BAC clone de novo assembly. The BAC clones have been pooled and have been sequenced with 2 SMRT cells to obtain an effective coverage of ~70X after removal of E.coli and vector DNA sequences. There are many BAC clone pool that we wish to assemble and scaffold using BioNano Genome Map. In addition, each pool of BAC clones will have different repetitive content and therefore, we think would require slight adjustments for each assembly attempt.

I am wondering if you have any recommended parameter adjustments to optimize the assembly of BAC clones. I am trying the assembly with various parameters and visualizing the assembly with BANDAGE to observe the effect of each parameter.

Any advice would be highly appreciated.

Best, Jin

sjin09 commented 7 years ago

Dear Jason,

In addition, I also wanted to ask whether during the pread formation, it is possible to retain the information from contiguous and long subreads as much as possible. If we assume that the BAC clones are on average 100,000 bp and if there is a subread that is greater than 40,000 bp, then almost half of the BAC clone is represented in a single subread. However, there are not enough coverage for these long subreads and they are fragmented during the pread generation. Would reducing the --min-cov 1 to 1 be a sufficient solution?

Best, Jin

pb-jchin commented 7 years ago

@sjin09

while FALCON follows the general principle for assembly and can be tuned to assembly BACs, it is not the common use for it.

Why?

For assembling large genome, we typically use some coverage measurement to avoid excessive computation for repeat elements in a genome. Such strategy will only work if the genome assembly coverage is uniform. In our experience, to have perfect mixtures of different BAC is hard. Namely, you have to tune off most coverage filters to avoid filter out useful stuff for BAC assembly.

2nd thing is the one you mention in the 2nd post. The BAC size/pread length ratio is high. Namely, typically, you should be able to assemble the BAC with 3 to 10 reads. This is also not the "typical" case for large genome assemblies. For example, the standard output of contigs in FALCON left out the first reads as it is possible the first read is used by other contigs. This will not be ideal for BAC assembly as the first read can be half of the BAC. This can be fixed by simple post-processing though. (And yes, if you have only the longest reads with low coverages on a BAC, --min-cov 1 is good.)

@rhallPB has some experience on BAC assembly. (I have some too but I have to do special tuning.) I think he might be able to give you more insight and some useful ideas.

rhallPB commented 7 years ago

From my experience, while it is possible to parameterize Falcon to assemble BAC pools, it is much easier to use HGAP.3. In HGAP.3 the parameters are much more robust to this kind of assembly. What level of pooling are you doing, and what is the library procedure? If the BAC vector is going to be easily spanned by the reads, it is worth leaving it in the data, and validating assemblies by looking for circular structures in the graph.

sjin09 commented 7 years ago

Dear @pb-jchin and @rhallPB

I would first like to thank you for the swift response.

I wanted to use FALCON as I found it to be faster than canu, but canu consistently gave better results in its automated mode. I also liked the fact that I could visualize falcon's output while the canu's output in gfa cannot be visualized with bandage effectively. The BAC clones are mainly derived from chromosome X and chromosome Y, which I think will be highly repetitive and may require multiple attempts to obtain the best results. This is where I thought FALCON's speed would be immensely useful.

@rhallPB: in a single pool, there are approximately 48 BAC clones, and the BAC clones have been sequenced with a variable coverage (from 70X to 200X) after E.coli and Vector sequence removal in each pool. The BAC clones have been sequenced with P6C4 on PacBio RSII. I thought that HGAP would be a good choice as Eichler's group have been consistently using it for their BAC assembly, but I thought the BLASR alignment for read to read comparison would be too slow for multiple attempts.

We observed that a single subhead sometimes indeed have vector sequences that is spanned by the subhead and flanked by the BAC insert sequences. We thought we will use this for BAC assembly assessment. We additionally have BAC end sequences to align to the contig for assembly assessemnt. We have found that since the coverage of vector sequences are higher than the BAC insert coverage, the vector sequences acts as repeats, and creates problems for the assembly as noted here (https://github.com/KiddLab/pacbio-vector-trim).

I will try the assembly again with HGAP.3 and compare the results with that of canu. I eagerly wait for further advices.

Best, Jin

sjin09 commented 7 years ago

DBstats results from the subreads.

Statistics for all reads of length 500 bases or more

    110,046 reads        out of         160,758  ( 68.5%)
945,177,660 base pairs   out of   1,220,841,447  ( 77.4%)

      8,588 average read length
      5,129 standard deviation

Base composition: 0.296(A) 0.212(C) 0.212(G) 0.281(T)

Distribution of Read Lengths (Bin size = 1,000)

    Bin:      Count  % Reads  % Bases     Average
 48,000:          1      0.0      0.0       48637
 47,000:          1      0.0      0.0       47820
 46,000:          0      0.0      0.0       47820
 45,000:          3      0.0      0.0       46339
 44,000:          1      0.0      0.0       45980
 43,000:          5      0.0      0.1       44816
 42,000:          1      0.0      0.1       44639
 41,000:          4      0.0      0.1       43857
 40,000:         10      0.0      0.1       42594
 39,000:          7      0.0      0.1       41924
 38,000:          7      0.0      0.2       41327
 37,000:          8      0.0      0.2       40697
 36,000:         16      0.1      0.3       39660
 35,000:         13      0.1      0.3       38932
 34,000:         14      0.1      0.4       38249
 33,000:         23      0.1      0.4       37292
 32,000:         30      0.1      0.6       36290
 31,000:         35      0.2      0.7       35339
 30,000:         37      0.2      0.8       34507
 29,000:         46      0.2      0.9       33623
 28,000:         57      0.3      1.1       32699
 27,000:         67      0.4      1.3       31795
 26,000:         94      0.4      1.6       30751
 25,000:        119      0.5      1.9       29698
 24,000:        143      0.7      2.3       28687
 23,000:        185      0.8      2.7       27648
 22,000:        258      1.1      3.3       26518
 21,000:        345      1.4      4.1       25376
 20,000:        505      1.8      5.2       24161
 19,000:        695      2.5      6.6       22966
 18,000:        970      3.4      8.5       21786
 17,000:      1,474      4.7     11.3       20555
 16,000:      2,205      6.7     15.1       19333
 15,000:      3,223      9.6     20.4       18159
 14,000:      4,656     13.9     27.5       17032
 13,000:      6,283     19.6     36.5       15993
 12,000:      7,847     26.7     46.8       15059
 11,000:      7,819     33.8     56.3       14312
 10,000:      6,940     40.1     64.1       13714
  9,000:      6,405     45.9     70.5       13181
  8,000:      6,312     51.7     76.2       12661
  7,000:      6,433     57.5     81.3       12136
  6,000:      6,943     63.8     86.1       11579
  5,000:      7,519     70.7     90.4       10991
  4,000:      7,534     77.5     94.0       10417
  3,000:      7,306     84.1     96.7        9872
  2,000:      7,308     90.8     98.7        9332
  1,000:      6,891     97.0     99.7        8827
      0:      3,248    100.0    100.0        8588
rhallPB commented 7 years ago

The HGAP.3 job wont take too long on a reasonable computer. To convert the output of HGAP.3 into a graph file format that can be visualized in gephi https://gist.github.com/rhallPB/2d962e700d83270b0109

sjin09 commented 7 years ago

Dear @rhallPB

I have been successful with the HGAP3 de novo assembly in a de novo assembly without sungrid engine. The de novo assembly, however, failed in the runCaToUnitig.log when the sungrid engine was used with memory issues.

If possible I wanted to test whether HGAP4 with its improvements can be performed on the sun grid engine. However, the whitelist tutorial seems to be not directly applicable with the HGAP4 pipeline where the preset.xml has to be changed instead of settings.xml. The preset.xml also seems to very different from the settings.xml that I modified to run HGAP3.

I would appreciate any directions for producing a whitelist for BAC de novo assembly using HGAP4.

p.s: it seems that XML file cannot be posted directly.

rhallPB commented 7 years ago

At this time we do not have a workflow for white listing on HGAP.4. preset.xml and settings.xml are completely different parameter file formats. HGAP.4 is an implementation of Falcon using the pbsmrtpipe workflow engine, you can get the same results using Falcon (which supports sge), and simply filtering the reads in the input fasta files, but as stated previously Falcon likely won't give as good results as HGAP.3, without a lot of testing and optimizing of parameters.

sjin09 commented 7 years ago

Dear @rhallPB

Thank you for the swift response.

I have been tuning HGAP3 under smrtanalysis_2.3.0.1409 for BAC de novo assembly. I have been tuning the computeLengthCutoff (true/false), minLongReadLength (4500-10000bp) and minSubReadLength (3000-4500bp) to optimize the BAC de novo assembly. However, the results were not as satisfactory compared to the BAC de novo assembly with CANU. The CANU assembly had an average length of 90 kb whereas the HGAP assembly produced contigs with an average length of 50 kb.

What kind of other parameter optimization do you suggest to achieve better BAC de novo assemblies? The coverage for the BACs are on average ~50X.

Best, Jin

sjin09 commented 7 years ago

Dear @rhallPB

In addition, I wanted to inquire whether whitelisting for QUIVER is possible. If the HGAP results does not produce the desired results, I wanted to perform QUIVER on contigs from CANU without the subreads derived from E.coli genomic DNA.

Regards, Jin

rhallPB commented 7 years ago

To optimize this kind of assembly, where the coverage for individual BACs is variable, and the repeat content complex, I would visualize the overlap graph in gephi (see post above) to see if the issue is fragmentation in the graph, vs. complex overlaps (knots, due to repeats not being spanned). Fragmention in the graph would mean that you need to change parameters that allow more coverage in the overlap graph layout (lower seed read, lower subread cutoff, lower coverage required for preassembled read generation), complex overlaps would suggest you need to increase the seed read length, subread cut etc. Obviously given a not very equal pooled sample, these parameters may be different for different BACs in the pool, therefore it may be necessary to use different parameters for different BACs. Whitelisting for quiver is possible, it's done in exactly the same way as whitelisting for HGAP.3, but unless the BACs contain sequence with significant homology to ecoli I wouldn't worry about it. No ecoli data is going to map to your BACs if you use reasonable parameters for the alignment.

sjin09 commented 7 years ago

Cheers @rhallPB.

I will try the parameter optimization as you have mentioned.

sjin09 commented 7 years ago

Dear @rhallPB

Would it be correct to assume that "lower coverage required for preassembled read generation" would be equivalent to lowering the value for param name="xCoverage" label="Target Coverage"in the settings.xml.

Or were you referring to raising the values for param name="minCorCov" label="Minimum Coverage for Correction"?

In addition, I am trying to maintain the ends of the BAC insert as much as possible as I would like to align BAC end sequences back to the BAC contigs for identification of the BAC origin. The ends of the BAC would probably have lower coverage than other parts of the BAC contigs. Is there a parameter that I should be particularly careful for this setting?

rhallPB commented 7 years ago

minCorCov

Are you trimming out the vector before assembling? If not, then there is no reason that the ends would have less coverage. If you are trimming out the vector, then simply don't trim out the full vector sequence until after assembly, i.e. leave a few hundred base pairs of the vector on each side, so that the BAC end is never actually the end of the sequence.

sjin09 commented 7 years ago

Dear @rhallPB

I have been able to obtain a better assembly by reducing the lower seed read and minimum subread length as you have recommended. I am now optimizing the assembly to better represent the ends of the BAC inserts.

I have aligned .bax.h5 files to the vector sequence using BLASR. Thereafter, I have masked the portions of the subreads in the .bax.h5 with vector sequence by modifying the high quality sections in the region table. I will try unmasking some of the sections that has matched to the vector sequence.

Best, Jin

sjin09 commented 7 years ago

Thank you for all the help. I think I got the BAC de novo assembly working reasonably well with the parameter optimizations that you have suggested. I have closed the issue.