MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
88 stars 29 forks source link

Start position in FastA description lines off by one from start position in GFF3 #153

Closed kubu4 closed 3 months ago

kubu4 commented 3 months ago

The Results.gff3 starting coordinates are 1 greater than those listed in the FastA description lines.

Here's an example that I'm seeing (this holds true for all the sequnces in both files, though):

grep "Cluster_316.mature" Results.gff3 mir.fasta

Results.gff3:NC_058066.1    ShortStack  mature_miRNA    12757147    12757168    1413    -   .   ID=Cluster_316.mature;Parent=Cluster_316
mir.fasta:>Cluster_316.mature::NC_058066.1:12757146-12757168(-)

The Results.gff3 lists the starting position as: 12757147

The mir.fasta lists the starting position as: 12757146

If using the gff3 start/stop positions, then the expected length of the sequence would be 22 bases. Examine the length of the sequence in the corresponding FastA entry:

grep "Cluster_316.mature" mir.fasta.fai

Cluster_316.mature::NC_058066.1:12757146-12757168(-)    22  196 22  23

We see that 22 bases is indeed the length, so it appears that the FastA description line is incorrect for some reason.

MikeAxtell commented 3 months ago

Oh my, you are correct. Looks like the coordinates in the headers of the mir.fasta file are incorrect. I bet it is because, internally, I use bed coordinates which are zero-based half open, but forgot to convert to one-based fully closed which is what I mean to put in the output. Should be a simple patch...

MikeAxtell commented 3 months ago

Yes, internally ShortStack using bedtools getfasta with a temporary BED file to write the FASTA of microRNAs, microRNA-stars, and precursors. The headers are auto-generated by bedtools getfasta, which used BED-style (zero-based, half-open) genomic coordinates instead of the one-based, fully closed coordinates that one expects for tools like samtools. So the coordinates are correct, but the coordinate system diverges from what I would consider the more "normal" one-based fully closed coordinate system.

MikeAxtell commented 3 months ago

As of commit bab44af the coordinates in the headers of file mir.fasta are now in one-based, fully-closed coordinate system.

This will be included in the next release (not sure when that will be).

Thanks for pointing this out!