Kuanhao-Chao / LiftOn

🚀 LiftOn: Accurate annotation mapping for GFF/GTF across assemblies
http://ccb.jhu.edu/lifton
GNU General Public License v3.0
57 stars 2 forks source link

LiftOff generates better BUSCO scores than LiftOn #24

Open 14zac2 opened 3 weeks ago

14zac2 commented 3 weeks ago

Hi there,

I was testing LiftOff and LiftOn to see which one is "best" for genome annotation. I was looking at some RefSeq genomes, lifting the brown-headed cowbird and red-winged blackbird onto the bronzed cowbird. Interestingly, in both cases LiftOff created better BUSCO scores and LiftOn also generated some weird features where the end coordinate of the feature was earlier than the start coordinate. Although LiftOff generated better BUSCO scores, GFFCompare suggested that the LiftOn genome had more matching transcripts, but the precision was a tad lower. As a result, I feel I trust LiftOff more as an annotation tool and wanted to bring these results to your attention.

Here is my BUSCO score for LiftOff, brown-headed cowbird on bronzed cowbird:

    ----------------------------------------------------
    |Results from dataset passeriformes_odb10           |
    ----------------------------------------------------
    |C:93.6%[S:60.2%,D:33.4%],F:1.2%,M:5.2%,n:10844     |
    |10148    Complete BUSCOs (C)                       |
    |6526    Complete and single-copy BUSCOs (S)        |
    |3622    Complete and duplicated BUSCOs (D)         |
    |127    Fragmented BUSCOs (F)                       |
    |569    Missing BUSCOs (M)                          |
    |10844    Total BUSCO groups searched               |
    ----------------------------------------------------

And here is GFFCompare, when comparing this brown-headed cowbird LiftOff annotation to the RefSeq annotation:

#-----------------| Sensitivity | Precision  |
        Base level:    89.2     |    84.4    |
        Exon level:    85.6     |    87.7    |
      Intron level:    90.0     |    92.2    |
Intron chain level:    59.9     |    58.1    |
  Transcript level:    60.4     |    58.7    |
       Locus level:    71.2     |    74.4    |

     Matching intron chains:   17652
       Matching transcripts:   18650
              Matching loci:   13081

          Missed exons:   15337/199727  (  7.7%)
           Novel exons:   10077/195294  (  5.2%)
        Missed introns:   11294/180584  (  6.3%)
         Novel introns:    5639/176399  (  3.2%)
           Missed loci:    1914/18371   ( 10.4%)
            Novel loci:    1166/17582   (  6.6%)

Here is BUSCO for LiftOn, brown-headed cowbird on bronzed cowbird:

    -----------------------------------------------------
    |Results from dataset passeriformes_odb10            |
    -----------------------------------------------------
    |C:71.1%[S:47.5%,D:23.6%],F:0.4%,M:28.5%,n:10844     |
    |7708    Complete BUSCOs (C)                         |
    |5150    Complete and single-copy BUSCOs (S)         |
    |2558    Complete and duplicated BUSCOs (D)          |
    |46    Fragmented BUSCOs (F)                         |
    |3090    Missing BUSCOs (M)                          |
    |10844    Total BUSCO groups searched                |
    -----------------------------------------------------

GFFCompare for LiftOn brown-headed cowbird compared to RefSeq:

#-----------------| Sensitivity | Precision  |
        Base level:    89.5     |    84.2    |
        Exon level:    86.0     |    87.2    |
      Intron level:    90.4     |    91.7    |
Intron chain level:    60.2     |    57.9    |
  Transcript level:    60.8     |    58.4    |
       Locus level:    71.8     |    73.5    |

     Matching intron chains:   17721
       Matching transcripts:   18765
              Matching loci:   13198

          Missed exons:   14512/199727  (  7.3%)
           Novel exons:   11255/197273  (  5.7%)
        Missed introns:   10598/180584  (  5.9%)
         Novel introns:    6515/177980  (  3.7%)
           Missed loci:    1760/18371   (  9.6%)
            Novel loci:    1388/17956   (  7.7%)

BUSCO for LiftOff of red-winged blackbird on bronzed cowbird:

    ----------------------------------------------------
    |Results from dataset passeriformes_odb10           |
    ----------------------------------------------------
    |C:97.3%[S:69.3%,D:28.0%],F:0.4%,M:2.3%,n:10844     |
    |10551    Complete BUSCOs (C)                       |
    |7510    Complete and single-copy BUSCOs (S)        |
    |3041    Complete and duplicated BUSCOs (D)         |
    |39    Fragmented BUSCOs (F)                        |
    |254    Missing BUSCOs (M)                          |
    |10844    Total BUSCO groups searched               |
    ----------------------------------------------------

GFFCompare of LiftOff red-winged blackbird compared to RefSeq annotation:

#-----------------| Sensitivity | Precision  |
        Base level:    76.9     |    90.0    |
        Exon level:    82.0     |    85.5    |
      Intron level:    87.0     |    90.2    |
Intron chain level:    44.3     |    46.0    |
  Transcript level:    45.3     |    46.8    |
       Locus level:    62.1     |    66.1    |

     Matching intron chains:   13052
       Matching transcripts:   13987
              Matching loci:   11409

          Missed exons:   19028/199727  (  9.5%)
           Novel exons:   10245/191797  (  5.3%)
        Missed introns:   12796/180584  (  7.1%)
         Novel introns:    5096/174094  (  2.9%)
           Missed loci:    2101/18371   ( 11.4%)
            Novel loci:     923/17215   (  5.4%)

BUSCO of LiftOn red-winged blackbird onto bronzed cowbird:

    -----------------------------------------------------
    |Results from dataset passeriformes_odb10            |
    -----------------------------------------------------
    |C:61.3%[S:44.0%,D:17.3%],F:0.4%,M:38.3%,n:10844     |
    |6649    Complete BUSCOs (C)                         |
    |4768    Complete and single-copy BUSCOs (S)         |
    |1881    Complete and duplicated BUSCOs (D)          |
    |44    Fragmented BUSCOs (F)                         |
    |4151    Missing BUSCOs (M)                          |
    |10844    Total BUSCO groups searched                |
    -----------------------------------------------------

GFFCompare of LiftOn red-winged blackbird compared to RefSeq annotation:

#-----------------| Sensitivity | Precision  |
        Base level:    77.2     |    89.9    |
        Exon level:    82.4     |    85.3    |
      Intron level:    87.4     |    90.1    |
Intron chain level:    44.9     |    46.2    |
  Transcript level:    45.9     |    47.0    |
       Locus level:    63.0     |    66.1    |

     Matching intron chains:   13209
       Matching transcripts:   14176
              Matching loci:   11571

          Missed exons:   18326/199727  (  9.2%)
           Novel exons:   10918/193146  (  5.7%)
        Missed introns:   12242/180584  (  6.8%)
         Novel introns:    5639/175141  (  3.2%)
           Missed loci:    1977/18371   ( 10.8%)
            Novel loci:    1050/17477   (  6.0%)
Kuanhao-Chao commented 3 weeks ago

Thanks @14zac2 for sharing the results with us! I’ll definitely be looking into them closely. If possible, could you please share the genome and annotation files with me? That would be incredibly helpful.

There’s still some work to be done to address a few edge cases to improve LiftOn. I’m confident that after resolving these issues, LiftOn will perform as well as, if not better than, current methods on those more divergent genes.

I’m currently on an internship until the end of August, so I’ll revisit this in September. It was great meeting you at the conference, and thanks again for testing LiftOn!

14zac2 commented 3 weeks ago

Sure thing! All of these were RefSeq genomes and annotations, so I'll link to the FTPs of the species below. In each case, I used the *.fna FASTA files and the GFF annotations.

Bronzed cowbird: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/037/042/795/GCF_037042795.1_BPBGC_Maene_1.0/

Brown-headed cowbird: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/012/460/135/GCF_012460135.2_BPBGC_Mater_1.1/

Red-winged blackbird: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/745/825/GCF_020745825.1_Agelaius_phoeniceus_1.1/

For LiftOff, I used a Docker container for version 1.6.3 and my script was as follows:

echo "Target genome: $1"
echo "Reference genome: $2"
echo "Reference GFF: $3"
echo "Output GFF: $4"
#echo "Feature list: $5"

mkdir liftoff

docker run -v "$(pwd)":/tmp staphb/liftoff liftoff \
 "/tmp/$1" "/tmp/$2" -g "/tmp/$3" -o "/tmp/liftoff/$4" \
 -u "/tmp/liftoff/unmapped_features.txt" \
 -dir "/tmp/liftoff/intermediate_files" \
 -copies -p 20 -polish -flank 0.5

I tried to replicate the same parameters with LiftOn:

echo "Target genome: $1"
echo "Reference genome: $2"
echo "Reference GFF: $3"
echo "Output GFF: $4"

source /home/zclarke/anaconda2/etc/profile.d/conda.sh
source ~/bin/lifton_env/bin/activate
conda activate lifton

mkdir lifton
cd lifton

lifton "../$1" "../$2" -g "../$3" -o $4 \
 -copies -t 20 -polish -flank 0.5

It was great meeting you, as well, and best of luck with your tool! I'd be very curious to understand what's going on here.

Cheers, Zoe