Kuanhao-Chao / LiftOn

🚀 LiftOn: Accurate annotation mapping for GFF/GTF across assemblies
http://ccb.jhu.edu/lifton
GNU General Public License v3.0
59 stars 3 forks source link

FeatureNotFoundError with gffutils #5

Closed inesliroulet closed 4 months ago

inesliroulet commented 4 months ago

Hello,

I am trying to use this tool to annotate a genome assembly of a plant species, Centaurea corymbosa, using the data of a closely related species, Centaurea solstitialis. I have launched the tool like so: lifton -g annotation.gff -P proteins.fa -t 30 centaurea_genome.fasta genome.fa with annotation.gff, proteins.fa and genome.fa being Centaurea solstitialis data downloaded directly from NCBI.

During the miniprot annotation step, I get this error:

>> Creating miniprot annotation database : miniprot.gff3
2024-05-22 17:24:32,880 - INFO - Populating features
2024-05-22 17:25:24,586 - INFO - Populating features table and first-order relations: 350342 features
2024-05-22 17:25:24,586 - INFO - Updating relations
2024-05-22 17:25:30,261 - INFO - Creating relations(parent) index
2024-05-22 17:25:30,442 - INFO - Creating relations(child) index
2024-05-22 17:25:30,662 - INFO - Creating features(featuretype) index
2024-05-22 17:25:31,093 - INFO - Creating features (seqid, start, end) index
2024-05-22 17:25:31,809 - INFO - Creating features (seqid, start, end, strand) index
2024-05-22 17:25:32,583 - INFO - Running ANALYZE features
/opt/biotools/conda/envs/LiftOn_env/lib/python3.12/site-packages/Bio/Seq.py:2880: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
  warnings.warn(
aligning features
lifting features
>> LiftOn processed: 20 features.Traceback (most recent call last):
  File "/opt/biotools/conda/envs/LiftOn_env/bin/lifton", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/biotools/conda/envs/LiftOn_env/lib/python3.12/site-packages/lifton/lifton.py", line 352, in main
    run_all_lifton_steps(args)
  File "/opt/biotools/conda/envs/LiftOn_env/lib/python3.12/site-packages/lifton/lifton.py", line 300, in run_all_lifton_steps
    lifton_gene = run_liftoff.process_liftoff(None, locus, ref_db.db_connection, l_feature_db, ref_id_2_m_id_trans_dict, m_feature_db, tree_dict, tgt_fai, ref_proteins, ref_trans, ref_features_dict, fw_score, fw_chain, args, ENTRY_FEATURE=True)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/biotools/conda/envs/LiftOn_env/lib/python3.12/site-packages/lifton/run_liftoff.py", line 162, in process_liftoff
    lifton_gene = process_liftoff(parent_feature, feature, ref_db, l_feature_db, ref_id_2_m_id_trans_dict, m_feature_db, tree_dict, tgt_fai, ref_proteins, ref_trans, ref_features_dict, fw_score, fw_chain, args)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/biotools/conda/envs/LiftOn_env/lib/python3.12/site-packages/lifton/run_liftoff.py", line 170, in process_liftoff
    lifton_trans, cds_num = lifton_add_trans_exon_cds(lifton_gene, locus, ref_db, l_feature_db, ref_trans_id)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/biotools/conda/envs/LiftOn_env/lib/python3.12/site-packages/lifton/run_liftoff.py", line 69, in lifton_add_trans_exon_cds
    lifton_trans = lifton_gene.add_transcript(ref_trans_id, copy.deepcopy(locus), copy.deepcopy(ref_db[ref_trans_id].attributes))
                                                                                                ~~~~~~^^^^^^^^^^^^^^
  File "/opt/biotools/conda/envs/LiftOn_env/lib/python3.12/site-packages/gffutils/interface.py", line 296, in __getitem__
    raise FeatureNotFoundError(key)
gffutils.exceptions.FeatureNotFoundError: rna-OSB04

I have checked that the 'rna-OSB04' feature is present in the gff file, but I don't really know what is wrong with it. I thought it might be because of the file format so I tried to launch the tool with the gtf file instead, but it finds 0 feature in the file and thinks it's an empty file.

I have also tried the tool with another, less-closely related species (Cynara cardunculus) and it worked just fine with the gff file (the command was lifton -g annotation.gff -P proteins.fa -T rna.fa -t 30 centaurea_genome.fasta genome.fa).

Do you have an idea of what might be wrong in the case of C. solstitialis ?

Thank you for your help

Kuanhao-Chao commented 4 months ago

Hi @inesliroulet,

Thanks for your feedback! To answer briefly, please just remove the protein argument, -P proteins.fa, and it should work correctly.

Here is a more detailed explanation: When you run LiftOn, it will extract proteins from the annotation file and store them in a FASTA file. If you've already run LiftOn and want to run it again, you can pass the file using the -P protein argument to save some time. The reason you encountered the error is that the protein IDs listed in your provided file do not match the IDs stored in the GFF file. NCBI slightly changes the protein IDs.

This is my assumption based on some common mistakes that might happen. If it does not solve your problem, feel free to send me your data or files: kuanhao.chao@gmail.com.

inesliroulet commented 4 months ago

Hello again @Kuanhao-Chao,

Thank you for your answer. Sadly, I have tried launching the tool again without the protein argument (lifton -g annotation.gff -t 30 centaurea_genome.fasta genome.fa), and I get the exact same error again, so it seems like it's not the protein IDs.

Here are the links to the NCBI data I use :

I will also send them to you via email along with the genome I am trying to annotate.

Thank you very much for your help

Kuanhao-Chao commented 4 months ago

Hi @inesliroulet,

Thanks for sharing the data with me. I ran LiftOn on your dataset and did not encounter the error. Here is the command that I ran:

lifton -g /data/reference/annotation.gff -o lifton.gff3 -polish -copies -sc 0.95 /data/target/centaurea_genome.fasta /data/reference/genome.fa

I have also sent an email with the results to you. Please let me know if you have any more questions!

Kuan-Hao