Thank you for the tool, but I'm having issues getting it to work with any other file than the test files.
Is this start of the output even correct for the E_coli test data?
Warning: The following lines have no ID and will be excluded from the analysis:
NC_012967.1 minced:0.1.6 repeat_region 2772561 2772833 5 . . rpt_family=CRISPR
NC_012967.1 minced:0.1.6 repeat_region 2792513 2793335 14 . . rpt_family=CRISPR
index file ./REL606.fasta.fai not found, generating...
Warning: The following lines have no ID and will be excluded from the analysis:
gnl|X|Y_contig000211 minced:0.1.6 repeat_region 3 275 5 . . rpt_family=CRISPR
index file ./SRR030253.fasta.fai not found, generating...
Warning: The following lines have no ID and will be excluded from the analysis:
gnl|X|Y_contig000047 minced:0.1.6 repeat_region 54 265 4 . . rpt_family=CRISPR
gnl|X|Y_contig000082 minced:0.1.6 repeat_region 1 482 8 . . rpt_family=CRISPR
index file ./SRR030254.fasta.fai not found, generating...
Warning: The following lines have no ID and will be excluded from the analysis:
gnl|X|Y_contig000107 minced:0.1.6 repeat_region 9 586 10 . . rpt_family=CRISPR
index file ./SRR030255.fasta.fai not found, generating...
Warning: The following lines have no ID and will be excluded from the analysis:
gnl|X|Y_contig000033 minced:0.1.6 repeat_region 15679 16315 11 . . rpt_family=CRISPR
index file ./SRR030256.fasta.fai not found, generating...
Warning: The following lines have no ID and will be excluded from the analysis:
gnl|X|Y_contig000029 minced:0.1.6 repeat_region 1 208 4 . . rpt_family=CRISPR
gnl|X|Y_contig000085 minced:0.1.6 repeat_region 30 303 5 . . rpt_family=CRISPR
index file ./SRR030257.fasta.fai not found, generating...
-----CLUSTERING WITH CD-HIT-EST-----
I assume because it's not a CDS, but I'm not sure what the expected behaviour is. Anyhow...
I downloaded my .gff files from RefSeq (e.g. GCF_009674825.1):
To make .gff inputs for GenAPI I easily appended the genome sequence (genomic.fna) to the end of the .gff file. Because the FASTA header contains more than just the location ID: >NZ_WKJH01000027.1 Maribacter luteus strain RZ05 NODE_103_length_14485_cov_31.436865, whole genome shotgun sequence I removed everything after the ID, leaving only >NZ_WKJH01000027.1 for every FASTA entry.
The .gff needed to be modified even more, because I was running into the error with sed (similar to #16):
Warning: The following lines have no ID and will be excluded from the analysis:
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM967482v1
#!genome-build-accession NCBI_Assembly:GCF_009674825.1
#!annotation-date 04/19/2022 11:09:29
sed: -e expression #1, char 23: unknown command: `1'
and it would not run. So I removed all these additional lines because it wouldn't run otherwise.
Now that those lines are out of the way, there seems to be another issue. It seems it doesn't parse the genome sequence properly. I get:
Warning: The following lines have no ID and will be excluded from the analysis:
>NZ_WKJH01000027.1
GTCAGTTGCAAGAAGACCACTAGCACTTACGCTAAGAGCATTGTTCGTTGCCGGGTCAACTACCGCCGTGAACACATTGC
TCCCATTTAAATTCAATCCGTTCCCTGCCGTATATTGAGTATCGGTATCAACACCTGCAGTGATATCCGACAATGGAACA
and it prints out the entire FASTA section (i.e. genome). I am not sure what the issue is because the identifier corresponds to the individual CDS entries at the beginning of the .gff file:
##gff-version 3
##sequence-region NZ_WKJH01000027.1 1 14581
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2594478
NZ_WKJH01000027.1 RefSeq region 1 14581 . + . ID=NZ_WKJH01000027.1:1..14581;Dbxref=taxon:2594478;collection-date=2017-06;country=China: the Yellow Sea;gbkey=Src;isolation-source=ENVO00002007;lat-lon=35.47 N 119.6 E;mol_type=genomic DNA;old-name=Maribacter sp. RZ05;strain=RZ05;type-material=type material of
NZ_WKJH01000027.1 RefSeq gene 1 5640 . - . ID=gene-GJ691_RS16290;Name=GJ691_RS16290;gbkey=Gene;gene_biotype=protein_coding;locus_tag=GJ691_RS16290;old_locus_tag=GJ691_16285;partial=true;start_range=.,1
NZ_WKJH01000027.1 GeneMarkS-2+ CDS 1 5640 . - 0 ID=cds-WP_154368826.1;Parent=gene-GJ691_RS16290;Dbxref=Genbank:WP_154368826.1;Name=WP_154368826.1;gbkey=CDS;inference=COORDINATES: ab initio prediction:GeneMarkS-2+;locus_tag=GJ691_RS16290;partial=true;product=hypothetical protein;protein_id=WP_154368826.1;start_range=.,1;transl_table=11
NZ_WKJH01000027.1 RefSeq gene 6468 10106 . - . ID=gene-GJ691_RS16295;Name=GJ691_RS16295;gbkey=Gene;gene_biotype=protein_coding;locus_tag=GJ691_RS16295;old_locus_tag=GJ691_16290
What is causing the issue?
How are you parsing the FASTA sequence, is line length of the multi-line FASTA entries maybe set or expected explicitly? I sometimes found this to be an issue with input files that accept multi-line FASTA files.
On that note, what is your suggestion for preparing RefSeq data as input for your tool, without having to reannotate the genomic data with other additional tools (Prokka and co.)? Seems a bit redundant to me as it's already annotated with their PGAP pipeline. Thanks.
Thank you for the tool, but I'm having issues getting it to work with any other file than the test files.
Is this start of the output even correct for the E_coli test data?
I assume because it's not a CDS, but I'm not sure what the expected behaviour is. Anyhow...
I downloaded my .gff files from RefSeq (e.g. GCF_009674825.1):
To make .gff inputs for GenAPI I easily appended the genome sequence (genomic.fna) to the end of the .gff file. Because the FASTA header contains more than just the location ID:
>NZ_WKJH01000027.1 Maribacter luteus strain RZ05 NODE_103_length_14485_cov_31.436865, whole genome shotgun sequence
I removed everything after the ID, leaving only>NZ_WKJH01000027.1
for every FASTA entry.The .gff needed to be modified even more, because I was running into the error with sed (similar to #16):
and it would not run. So I removed all these additional lines because it wouldn't run otherwise.
Now that those lines are out of the way, there seems to be another issue. It seems it doesn't parse the genome sequence properly. I get:
and it prints out the entire FASTA section (i.e. genome). I am not sure what the issue is because the identifier corresponds to the individual CDS entries at the beginning of the .gff file:
What is causing the issue?
How are you parsing the FASTA sequence, is line length of the multi-line FASTA entries maybe set or expected explicitly? I sometimes found this to be an issue with input files that accept multi-line FASTA files.
On that note, what is your suggestion for preparing RefSeq data as input for your tool, without having to reannotate the genomic data with other additional tools (Prokka and co.)? Seems a bit redundant to me as it's already annotated with their PGAP pipeline. Thanks.