althonos / pyrodigal

Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. Now with SIMD!
https://pyrodigal.readthedocs.org
GNU General Public License v3.0
138 stars 5 forks source link

`sequence_id` used differently in translation and GFF outputs? #53

Closed schorlton-bugseq closed 4 months ago

schorlton-bugseq commented 4 months ago

Thanks for the great tool! I was trying to use the outputs of pyrodial, and I think I ran into an issue where the ID between the translated seq and GFF do not match.

I wrote my FASTA and GFF with:

    with open(pyrodigal_gff, "w") as dst:
        for i, (seq_id, genes) in enumerate(listed_predictions):
            genes.write_gff(dst, sequence_id=seq_id, header=(i == 0))

    with open(pyrodigal_proteins, "w") as dst:
        for seq_id, genes in listed_predictions:
            genes.write_translations(dst, sequence_id=seq_id, translation_table=genetic_code)

listed_predictions is a list of [(seq_id, pyrodigal.Genes), (seq_id2, pyrodigal.Genes2)]

In the FASTA I see:

>contig_1_1 # 2009 # 2440 # -1 # ID=3_1;partial=00;start_type=TTG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.488

However in the GFF I see:

contig_1    pyrodigal_v3.3.0    CDS 2009    2440    4.5 -   0   ID=contig_1_1;partial=00;start_type=TTG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.488;conf=73.59;score=4.46;cscore=7.02;sscore=-2.57;rscore=5.61;uscore=-2.14;tscore=-5.38;

Note the different IDs.

When running prodigal, I see: FASTA:

>contig_1_1 # 2009 # 2440 # -1 # ID=1_1;partial=00;start_type=TTG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.488

GFF:

contig_1    Prodigal_v2.6.3 CDS 2009    2440    4.5 -   0   ID=1_1;partial=00;start_type=TTG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.488;conf=73.59;score=4.46;cscore=7.02;sscore=-2.57;rscore=5.61;uscore=-2.14;tscore=-5.38;

I suspect this is because of the use of sequence_id here: https://github.com/althonos/pyrodigal/blob/9887bb29ff39428b205ed63205e755f5209b16e8/pyrodigal/lib.pyx#L3633

but num_seq here: https://github.com/althonos/pyrodigal/blob/9887bb29ff39428b205ed63205e755f5209b16e8/pyrodigal/lib.pyx#L3746

Should these be the same? Let me know if I could do anything differently and thanks for your help!!

althonos commented 4 months ago

This was added in #18 because Prodigal not reporting the gene names in the GFF output was more annoying than anything. You can just ignore whatever ID is in the FASTA output and just use the name of the FASTA record :+1:

schorlton commented 4 months ago

Thanks for the quick response. There are downstream tools which rely on prodigal outputs, and specifically leveraging information from both fasta and gff using the ID field to match records between files. Therefore I don't think this suggestion is optimal. Wouldn't a better solution be to also include sequence_id in the ID field of the translated seqs? If you're going to make that design decision for the gff, you could imagine the pyrodigal fasta output from above looks like:

>contig_1_1 # 2009 # 2440 # -1 # ID=contig_1_1;partial=00;start_type=TTG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.488

(Note change in ID)

Thanks again for consideration!

althonos commented 4 months ago

I added support for this in v3.4.0.

Pass full_id=True as an argument to write_genes, write_translations and write_gff to write either the full sequence name, or full_id=False to write the sequence index. The default are True for write_gff and False for write_translations and write_genes for backwards compatibility.