MPUSP / snakemake-crispr-guides

A Snakemake workflow for the design of small guide RNAs (sgRNAs) for CRISPR applications.
MIT License
3 stars 2 forks source link

Additional types of CDS annotations #21

Closed Michael-Astbury closed 8 months ago

Michael-Astbury commented 8 months ago

When running the pipeline for Synechococcus PCC 11901, the number of CDS considered by the pipeline was less than the number of CDS annotated in the genome.

This was due to some CDS annotations using GeneMarkS-2+ rather than RefSeq or protein homology. By including GeneMarkS-2+ in the gff_source_type dictionary in the get_genome script, I was able to include the previously missing CDS.

Are there any other methods of annotating CDS when assembling genomes that should be included here?

m-jahn commented 8 months ago

I'm not aware of many more different types of annotation. But you're right, in principle it should be possible to get features from various different sources. The default options are really just what I have encountered personally.

Can you paste the head of your annotation here? Would be useful to understand how this looks like.

I can imagine to simply add a list of different valid options/tags in the config.yml file. Then users can add their own when needed.

Michael-Astbury commented 8 months ago

Here's the head of the .gff:

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM557713v1
#!genome-build-accession NCBI_Assembly:GCF_005577135.1
#!annotation-date 10/08/2023 12:45:54
#!annotation-source NCBI RefSeq 
##sequence-region NZ_CP040360.1 1 3081514
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2579791
NZ_CP040360.1   RefSeq  region  1   3081514 .   +   .   ID=NZ_CP040360.1:1..3081514;Dbxref=taxon:2579791;Is_circular=true;Name=ANONYMOUS;collection-date=2017-04-17;country=Singapore: near Pulau Ubin;culture-collection=PCC:11901;gbkey=Src;genome=chromosome;isolation-source=estuarine water [ENVO:01000301];lat-lon=1.421583 N 103.955722 E;mol_type=genomic DNA;old-name=Synechococcus sp. NTU 1704;strain=PCC 11901
NZ_CP040360.1   RefSeq  gene    1   378 .   +   .   ID=gene-FEK30_RS01170;Name=FEK30_RS01170;gbkey=Gene;gene_biotype=protein_coding;locus_tag=FEK30_RS01170;old_locus_tag=FEK30_01170
NZ_CP040360.1   GeneMarkS-2+    CDS 1   378 .   +   0   ID=cds-WP_138071278.1;Parent=gene-FEK30_RS01170;Dbxref=GenBank:WP_138071278.1;Name=WP_138071278.1;gbkey=CDS;inference=COORDINATES: ab initio prediction:GeneMarkS-2+;locus_tag=FEK30_RS01170;product=hypothetical protein;protein_id=WP_138071278.1;transl_table=11
NZ_CP040360.1   RefSeq  gene    375 1010    .   -   .   ID=gene-FEK30_RS01175;Name=FEK30_RS01175;gbkey=Gene;gene_biotype=protein_coding;locus_tag=FEK30_RS01175;old_locus_tag=FEK30_01175
NZ_CP040360.1   GeneMarkS-2+    CDS 375 1010    .   -   0   ID=cds-WP_138071280.1;Parent=gene-FEK30_RS01175;Dbxref=GenBank:WP_138071280.1;Name=WP_138071280.1;gbkey=CDS;inference=COORDINATES: ab initio prediction:GeneMarkS-2+;locus_tag=FEK30_RS01175;product=hypothetical protein;protein_id=WP_138071280.1;transl_table=11
NZ_CP040360.1   RefSeq  gene    1086    2045    .   -   .   ID=gene-FEK30_RS01180;Name=FEK30_RS01180;gbkey=Gene;gene_biotype=protein_coding;locus_tag=FEK30_RS01180;old_locus_tag=FEK30_01180
NZ_CP040360.1   Protein Homology    CDS 1086    2045    .   -   0   ID=cds-WP_138071282.1;Parent=gene-FEK30_RS01180;Dbxref=GenBank:WP_138071282.1;Name=WP_138071282.1;Ontology_term=GO:0016491,GO:0016651;gbkey=CDS;go_function=oxidoreductase activity|0016491||IEA,oxidoreductase activity%2C acting on NAD(P)H|0016651||IEA;inference=COORDINATES: similar to AA sequence:RefSeq:WP_013320252.1;locus_tag=FEK30_RS01180;product=Gfo/Idh/MocA family oxidoreductase;protein_id=WP_138071282.1;transl_table=11
NZ_CP040360.1   RefSeq  gene    2162    2455    .   +   .   ID=gene-FEK30_RS01185;Name=FEK30_RS01185;gbkey=Gene;gene_biotype=protein_coding;locus_tag=FEK30_RS01185;old_locus_tag=FEK30_01185
NZ_CP040360.1   Protein Homology    CDS 2162    2455    .   +   0   ID=cds-WP_138071284.1;Parent=gene-FEK30_RS01185;Dbxref=GenBank:WP_138071284.1;Name=WP_138071284.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_017320150.1;locus_tag=FEK30_RS01185;product=ferredoxin;protein_id=WP_138071284.1;transl_table=11
NZ_CP040360.1   RefSeq  gene    2819    3112    .   +   .   ID=gene-FEK30_RS01190;Name=petF1;gbkey=Gene;gene=petF1;gene_biotype=protein_coding;locus_tag=FEK30_RS01190;old_locus_tag=FEK30_01190
NZ_CP040360.1   Protein Homology    CDS 2819    3112    .   +   0   ID=cds-WP_012307922.1;Parent=gene-FEK30_RS01190;Dbxref=GenBank:WP_012307922.1;Name=WP_012307922.1;gbkey=CDS;gene=petF1;inference=COORDINATES: similar to AA sequence:RefSeq:WP_017320150.1;locus_tag=FEK30_RS01190;product=ferredoxin PetF1;protein_id=WP_012307922.1;transl_table=11