almeidasilvaf / syntenet

An R package to infer and analyze synteny networks from protein sequences
https://almeidasilvaf.github.io/syntenet/
21 stars 6 forks source link

check_input error #12

Closed jhcuarta closed 1 year ago

jhcuarta commented 1 year ago

Hi I was wondering if you could help me out since my data didn't pass the check_input, I'm confused since both files were obtained using Prokka 1.14.6, names for protein and headers must match. Here's the headers for the .gff file and the .fasta

gff-version 3

sequence-region Vibrio_cholerae_strain_1Mo 1 4024350

Vibrio_cholerae_strain_1Mo prokka gene 123 1163 . - . ID=KLDFOOAE_00001_gene;locus_tag=KLDFOOAE_00001 Vibrio_cholerae_strain_1Mo prokka mRNA 123 1163 . - . ID=KLDFOOAE_00001_mRNA;locus_tag=KLDFOOAE_00001 Vibrio_cholerae_strain_1Mo Prodigal:002006 CDS 123 1163 . - 0 ID=KLDFOOAE_00001;Parent=KLDFOOAE_00001_gene,KLDFOOAE_00001_mRNA;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:RefSeq:WP_001909624.1;locus_tag=KLDFOOAE_00001;product=tape measure protein [Vibrio cholerae] Vibrio_cholerae_strain_1Mo prokka gene 1298 1657 . - . ID=KLDFOOAE_00002_gene;locus_tag=KLDFOOAE_00002 Vibrio_cholerae_strain_1Mo prokka mRNA 1298 1657 . - . ID=KLDFOOAE_00002_mRNA;locus_tag=KLDFOOAE_00002 Vibrio_cholerae_strain_1Mo Prodigal:002006 CDS 1298 1657 . - 0 ID=KLDFOOAE_00002;Parent=KLDFOOAE_00002_gene,KLDFOOAE_00002_mRNA;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:RefSeq:WP_000290080.1;locus_tag=KLDFOOAE_00002;product=MULTISPECIES: phage tail assembly protein [Vibrio]

>KLDFOOAE_00001 tape measure protein [Vibrio cholerae] MANNLKTDIVLNLQGDLAQKARSYSKEMTTLATRSKAAFSMISSSAIAASRGIDTFGNRL LFITGAAAVGFERTFVKTAAEFERYQTMLNKLQGSPEAGAKAMAWIEEFTQNTPYAIDEV TQSFVRLKAFGIDPMDGTMQSIADQAAMIGGTAETVEGIATALGQAWTKGKLQSEEALQL LERGVPVWDYLIKSSKELGMNNGRGFTKEELDDMSSKGKLGRDAIRALIKQMGKESAGAA KEQMNTWNGMISNMGDHWKLFQKDVMGSGAFTVLKDQLGEFLGMLDEMKKTGEYDEFVDK VGRDLVEAFKSAAAAAREIKEVGEELWPVIREIGSMACSGIVNLVT >KLDFOOAE_00002 MULTISPECIES: phage tail assembly protein [Vibrio] MAVMTFNLEDGFKVGDAQCHEVGLKELTPKDVFDAQLASEKIGILNGRPHAYTSDVQMGM ELLCRQVEFIGNVQGPFSVKEILKLSSRDFATLQQKARELDDIMFSDDALEGLEARGRD

I'm still confused in terms of what headers should match. I'm bewildered since both files were obtained from the same application Prokka 1.14.6 and coding must be preserved through files output. Nevertheless I reedit the .fasta file so it carries only the code, but no success. This is how fasta file would look after edition

>KLDFOOAE_00001 MANNLKTDIVLNLQGDLAQKARSYSKEMTTLATRSKAAFSMISSSAIAASRGIDTFGNRL LFITGAAAVGFERTFVKTAAEFERYQTMLNKLQGSPEAGAKAMAWIEEFTQNTPYAIDEV TQSFVRLKAFGIDPMDGTMQSIADQAAMIGGTAETVEGIATALGQAWTKGKLQSEEALQL LERGVPVWDYLIKSSKELGMNNGRGFTKEELDDMSSKGKLGRDAIRALIKQMGKESAGAA KEQMNTWNGMISNMGDHWKLFQKDVMGSGAFTVLKDQLGEFLGMLDEMKKTGEYDEFVDK VGRDLVEAFKSAAAAAREIKEVGEELWPVIREIGSMACSGIVNLVT >KLDFOOAE_00002 MAVMTFNLEDGFKVGDAQCHEVGLKELTPKDVFDAQLASEKIGILNGRPHAYTSDVQMGM ELLCRQVEFIGNVQGPFSVKEILKLSSRDFATLQQKARELDDIMFSDDALEGLEARGRD

I was wondering how should I edit my files so they match. Thanks ahead

almeidasilvaf commented 1 year ago

Hi, @jhcuarta

Thanks opening the issue and sharing what the data looks like.

You forgot to include the error message. However, looking at the first few lines of your GFF3 file, I noticed that it doesn't include a column named "gene_id" or "Name", which are typically found in standard GFF3 files (maybe Prokka returns a non-standard GFF3 file).

To extract gene IDs, syntenet looks for a column named gene_id in the GRanges object; if it doesn't find it, it looks for a column named Name; if it doesn't find any of these columns, it returns an error, as it probably indicates that you have a malformed GFF3 file.

What you did to the FASTA headers (keeping only the gene IDs and removing additional info) is right; this is what the FASTA must look like. I can see that the same IDs are present in the locus_tag field of the GFF3, which will become a locus_tag column in the GRanges object. Thus, you can solve the issue by creating a column named gene_id in your GRanges objects with the contents of the column locus_tag.

Supposing you have your GFF3 files as a list of GRanges objects (created with gff2GRangesList()) in an object named annotation, you would do:

# Loop through GRanges objects in the list adding a column `gene_id` containing the IDs in `locus_tag`
new_annotation <- lapply(annotation, function(x) {
    x$gene_id <- x$locus_tag
    return(x)
})

This way, syntenet will be able to extract gene IDs from the column gene_id and match them with the IDs from the FASTA headers (names of the AAStringSet objects).

Could you try that and let me know if it works?

Best, Fabricio

jhcuarta commented 1 year ago

Hi It worked perfectly

> check_input(proteomes, new_annotation) [1] TRUE

Thanks Best regards

almeidasilvaf commented 1 year ago

Glad to know it worked! I'll close the issue, then.