FGS++ compatibility with BBTools (BBMerge, Tadpole)

LeeBergstrand commented 3 years ago

Background

I am currently using the pre-assembly processing step of the ATLAS pipeline to generate enhanced input sequencing reads for FGS++. The pre-assembly processing steps of ATLAS that consists of k-mer-based error correction and read extension using BBTools Tadpole and kmer-based left-right read merging using BBMerge. The pre-assembly processing spits out the following files:

QC.errorcorr.merged_me.fastq.gz --> errorcorrected and merged reads QC.errorcorr.merged_R1.fastq.gz --> remaining errorcorrected unmerged right reads QC.errorcorr.merged_R2.fastq.gz --> remaining errorcorrected unmerged left reads QC.errorcorr.merged_se.fastq.gz --> errorcorrected singletons reads whose left or right and pair failed prior QC steps and was removed.

My plan is to use FGS++ on all four of these file to predict the proteins found within.

Problem Description

FGS++ has a -w flag. From the FGS++ -h command:

-w [1 or 0]:  1 if the sequence file has complete genomic sequences
              0 if the sequence file has short sequence reads

From the FragGeneScan paper about the -w flag (via @jmtsuji):

"Considering that complete genomic sequences are unlikely to contain indel errors, the transition probabilities to insertion and deletion states are set to 0 when applying FragGeneScan to gene prediction in complete genomic sequences."

Because ATLAS is performing error correction using Tadpole should the 1 flag or 0 flag be used for -w?

Problem Solution

[x] Review Tadpole paper to determine how error correction works.
[ ] Review FGS++ code to determine in greater detail what happening with the -w flag under the hood.

LeeBergstrand commented 3 years ago

@jmtsuji Thoughts?

jmtsuji commented 3 years ago

@LeeBergstrand If errors are already corrected, then you probably do not want to use the standard Illumina training models included with FGS++, because they incorporate correcting errors expected in raw Illumina reads. Just my initial thoughts. But I agree, it would be helpful to see what the -w flag does mechanistically.

LeeBergstrand commented 3 years ago

@jmtsuji

I read the FGS paper and the BBMerge paper that discusses Tadpole. Tadpole uses kmer frequencies to error correct reads, and bbmerge uses Tadpole code as a subcomponent. BBMerge has some additional optimizations the use paired-end information to improve error correction. The bbmerge paper notes:

Indel rates are noted because they can induce frameshifts, which disrupt gene annotation. BBMerge variants and USEARCH clustered together closely, with rates ranging from 0.81 (BBMerge-REM) to 0.88 (USEARCH) indels per 100 kbp (Table 3). The other tools yielded rates ranging from 1.08 (XORRO) to 1.52 (COPE), except for Stitch (47.78 per 100 kbp).

I assume that this means the bbmerge has low indels within the merged section. I assume its error correction step would remove indels from other parts of the reads before merging.

The FGS paper notes:

Considering that complete genomic sequences are unlikely to contain indel errors, the transition probabilities to insertion and deletion states are set to 0 when applying FragGeneScan to gene prediction in complete genomic sequences.

Because the FGS++ -w sets the tool to complete genomic sequence mode, this would set the gene prediction HMM to zero out the insertion and deletion states. The zeroing out would omit the frameshift detection and correction mechanism of fraggenescan, leaving behind only the ORF prediction parts of the HMM. If the zeroing out of the insertion and deletion is the only thing that the -w would should be able to combine FGS++ and BBMerge.

I still think there is a benefit to using FGS++ in the complete genomics sequence mode on the merged reads as the tool is optimized for finding gene fragments that are missing start/stop codons etc. Other gene prediction tools (e.g., Prodigal) may have issues with this as they are expecting complete sequences. Fraggenescan is designed to work with reads from 100-700 bp (illumina to 454/Sanger).

I'm going to proceed with using the -w flag.

Any thoughts?

LeeBergstrand commented 3 years ago

@jmtsuji I was going over the command line parameters again. I noticed that you can set the -w to 0 and -t to complete at the same time. In terms of command line parameters this makes the most sense to me as one running FGS++ on error corrected short reads.

       -w [1 or 0]:           1 if the sequence file has complete genomic sequences
                                0 if the sequence file has short sequence reads
       -t [train_file_name]:  file name that contains model parameters; this file should be in the -r directory
                           Note that four files containing model parameters already exist in the "train" directory
                           [complete] for complete genomic sequences or short sequence reads without sequencing error
                           [sanger_5] for Sanger sequencing reads with about 0.5% error rate
                           [sanger_10] for Sanger sequencing reads with about 1% error rate
                           [454_5] for 454 pyrosequencing reads with about 0.5% error rate
                           [454_10] for 454 pyrosequencing reads with about 1% error rate
                           [454_30] for 454 pyrosequencing reads with about 3% error rate
                           [illumina_1] for Illumina sequencing reads with about 0.1% error rate
                           [illumina_5] for Illumina sequencing reads with about 0.5% error rate
                           [illumina_10] for Illumina sequencing reads with about 1% error rate

e.g.,

FGS++ -s S32.fasta -o S32.faa -w 0 -t complete -p 24 -m 308000

LeeBergstrand / FragGeneScanPlusPlus