MigleSur / GenAPI

Gene Absence Presence Identification tool.
GNU General Public License v3.0
26 stars 4 forks source link

Warning: The following lines have no ID and will be excluded from the analysis: #17

Open lxsteiner opened 1 year ago

lxsteiner commented 1 year ago

Thank you for the tool, but I'm having issues getting it to work with any other file than the test files.

Is this start of the output even correct for the E_coli test data?

Warning: The following lines have no ID and will be excluded from the analysis:
NC_012967.1 minced:0.1.6 repeat_region 2772561 2772833 5 . . rpt_family=CRISPR
NC_012967.1 minced:0.1.6 repeat_region 2792513 2793335 14 . . rpt_family=CRISPR
index file ./REL606.fasta.fai not found, generating...
Warning: The following lines have no ID and will be excluded from the analysis:
gnl|X|Y_contig000211 minced:0.1.6 repeat_region 3 275 5 . . rpt_family=CRISPR
index file ./SRR030253.fasta.fai not found, generating...
Warning: The following lines have no ID and will be excluded from the analysis:
gnl|X|Y_contig000047 minced:0.1.6 repeat_region 54 265 4 . . rpt_family=CRISPR
gnl|X|Y_contig000082 minced:0.1.6 repeat_region 1 482 8 . . rpt_family=CRISPR
index file ./SRR030254.fasta.fai not found, generating...
Warning: The following lines have no ID and will be excluded from the analysis:
gnl|X|Y_contig000107 minced:0.1.6 repeat_region 9 586 10 . . rpt_family=CRISPR
index file ./SRR030255.fasta.fai not found, generating...
Warning: The following lines have no ID and will be excluded from the analysis:
gnl|X|Y_contig000033 minced:0.1.6 repeat_region 15679 16315 11 . . rpt_family=CRISPR
index file ./SRR030256.fasta.fai not found, generating...
Warning: The following lines have no ID and will be excluded from the analysis:
gnl|X|Y_contig000029 minced:0.1.6 repeat_region 1 208 4 . . rpt_family=CRISPR
gnl|X|Y_contig000085 minced:0.1.6 repeat_region 30 303 5 . . rpt_family=CRISPR
index file ./SRR030257.fasta.fai not found, generating...
-----CLUSTERING WITH CD-HIT-EST-----

I assume because it's not a CDS, but I'm not sure what the expected behaviour is. Anyhow...

I downloaded my .gff files from RefSeq (e.g. GCF_009674825.1):

annotation_hashes.txt
assembly_status.txt
GCF_009674825.1_ASM967482v1_assembly_report.txt
GCF_009674825.1_ASM967482v1_assembly_stats.txt
GCF_009674825.1_ASM967482v1_cds_from_genomic.fna
GCF_009674825.1_ASM967482v1_feature_count.txt
GCF_009674825.1_ASM967482v1_feature_table.txt
GCF_009674825.1_ASM967482v1_genomic_fasta.gff
GCF_009674825.1_ASM967482v1_genomic.fna
GCF_009674825.1_ASM967482v1_genomic_gaps.txt
GCF_009674825.1_ASM967482v1_genomic.gbff
GCF_009674825.1_ASM967482v1_genomic.gff
GCF_009674825.1_ASM967482v1_genomic.gtf
GCF_009674825.1_ASM967482v1_protein.faa
GCF_009674825.1_ASM967482v1_protein.gpff
GCF_009674825.1_ASM967482v1_rna_from_genomic.fna
GCF_009674825.1_ASM967482v1_translated_cds.faa
GCF_009674825.1_ASM967482v1_wgsmaster.gbff
md5checksums.txt

To make .gff inputs for GenAPI I easily appended the genome sequence (genomic.fna) to the end of the .gff file. Because the FASTA header contains more than just the location ID: >NZ_WKJH01000027.1 Maribacter luteus strain RZ05 NODE_103_length_14485_cov_31.436865, whole genome shotgun sequence I removed everything after the ID, leaving only >NZ_WKJH01000027.1 for every FASTA entry.

The .gff needed to be modified even more, because I was running into the error with sed (similar to #16):

Warning: The following lines have no ID and will be excluded from the analysis:
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM967482v1
#!genome-build-accession NCBI_Assembly:GCF_009674825.1
#!annotation-date 04/19/2022 11:09:29
sed: -e expression #1, char 23: unknown command: `1'

and it would not run. So I removed all these additional lines because it wouldn't run otherwise.

Now that those lines are out of the way, there seems to be another issue. It seems it doesn't parse the genome sequence properly. I get:

Warning: The following lines have no ID and will be excluded from the analysis:
>NZ_WKJH01000027.1
GTCAGTTGCAAGAAGACCACTAGCACTTACGCTAAGAGCATTGTTCGTTGCCGGGTCAACTACCGCCGTGAACACATTGC
TCCCATTTAAATTCAATCCGTTCCCTGCCGTATATTGAGTATCGGTATCAACACCTGCAGTGATATCCGACAATGGAACA

and it prints out the entire FASTA section (i.e. genome). I am not sure what the issue is because the identifier corresponds to the individual CDS entries at the beginning of the .gff file:

##gff-version 3
##sequence-region NZ_WKJH01000027.1 1 14581
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2594478
NZ_WKJH01000027.1       RefSeq  region  1       14581   .       +       .       ID=NZ_WKJH01000027.1:1..14581;Dbxref=taxon:2594478;collection-date=2017-06;country=China: the Yellow Sea;gbkey=Src;isolation-source=ENVO00002007;lat-lon=35.47 N 119.6 E;mol_type=genomic DNA;old-name=Maribacter sp. RZ05;strain=RZ05;type-material=type material of
NZ_WKJH01000027.1       RefSeq  gene    1       5640    .       -       .       ID=gene-GJ691_RS16290;Name=GJ691_RS16290;gbkey=Gene;gene_biotype=protein_coding;locus_tag=GJ691_RS16290;old_locus_tag=GJ691_16285;partial=true;start_range=.,1
NZ_WKJH01000027.1       GeneMarkS-2+    CDS     1       5640    .       -       0       ID=cds-WP_154368826.1;Parent=gene-GJ691_RS16290;Dbxref=Genbank:WP_154368826.1;Name=WP_154368826.1;gbkey=CDS;inference=COORDINATES: ab initio prediction:GeneMarkS-2+;locus_tag=GJ691_RS16290;partial=true;product=hypothetical protein;protein_id=WP_154368826.1;start_range=.,1;transl_table=11
NZ_WKJH01000027.1       RefSeq  gene    6468    10106   .       -       .       ID=gene-GJ691_RS16295;Name=GJ691_RS16295;gbkey=Gene;gene_biotype=protein_coding;locus_tag=GJ691_RS16295;old_locus_tag=GJ691_16290

What is causing the issue?
How are you parsing the FASTA sequence, is line length of the multi-line FASTA entries maybe set or expected explicitly? I sometimes found this to be an issue with input files that accept multi-line FASTA files.

On that note, what is your suggestion for preparing RefSeq data as input for your tool, without having to reannotate the genomic data with other additional tools (Prokka and co.)? Seems a bit redundant to me as it's already annotated with their PGAP pipeline. Thanks.