applied-bioinformatics / An-Introduction-To-Applied-Bioinformatics

Interactive lessons in bioinformatics.
http://readIAB.org
Other
792 stars 315 forks source link

position is weakly defined #253

Closed corburn closed 6 years ago

corburn commented 7 years ago

Would it make sense to use the term 'index' instead of 'position' in the pairwise alignment exercise when referring to location in a sequence?

Although not true of all programming languages, most designed after 1970 use 0-based numbering for arrays and refer to it as an 'index'.

In the few bioinformatics formats I am familiar 'position' can refer to either 0-based or 1-based numbering -- sometimes within the same specification (e.g. samtools).

Perhaps it would make sense if 'index' was reserved for 0-based numbering and 'position' was reserved for 1-based numbering?

IAB Pairwise Alignment Exercise

This is the pairwise alignment exercise I am referencing:

What is the seqeunce in s2 from position 500 to 505? https://github.com/caporaso-lab/An-Introduction-To-Applied-Bioinformatics/blame/master/book/exercises/pairwise-alignment.md#L198

This is the closest to a definition for 'position' as used in IAB, though 'position' is not mentioned.

Hint: The numbers here may look a little bit funny. That's because the first item in a list in python is 0, not 1. So, if you want the first entry from the list sequences you would type sequences[0], and if you want the second sequence you would type sequences[1]. https://github.com/caporaso-lab/An-Introduction-To-Applied-Bioinformatics/blame/master/book/exercises/pairwise-alignment.md#L111

Example of position inconsistently defined

https://samtools.github.io/hts-specs/SAMv1.pdf

1-based coordinate system A coordinate system where the first base of a sequence is one. In this coordinate system, a region is specified by a closed interval. For example, the region between the 3rd and the 7th bases inclusive is [3, 7]. The SAM, VCF, GFF and Wiggle formats are using the 1-based coordinate system. 0-based coordinate system A coordinate system where the first base of a sequence is zero. In this coordinate system, a region is specified by a half-closed-half-open interval. For example, the region between the 3rd and the 7th bases inclusive is [2, 7). The BAM, BCFv2, BED, and PSL formats are using the 0-based coordinate system.

https://samtools.github.io/hts-specs/VCFv4.3.pdf

POS - position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. It is permitted to have multiple records with the same POS. Telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig. (Integer, Required)

shiffer1 commented 7 years ago

Hey Jay, I think you are bringing up something that is important here and the exercise is created to address the confusion you have brought up. You are right to criticize that it is inconsistent, but the point is that it is inconsistent by nature. We as humans count by a one base system. So if I ask the normal person on the street what is the sixth character in "AGTCTAG" they would respond, "A". For a computer, it would say whoa what base system are we using here? So the point is to illustrate that a zero base system is different than what we normally think of. That being said there is an error in the description that states the correct values to use when looking for a sequence cut that needs to be remedied.