Author Name: Jason Stajich (@hyphaltip)
Original Redmine Issue: 3328, https://redmine.open-bio.org/issues/3328
Original Date: 2012-02-17
Original Assignee: Bioperl Guts
I am Cheng-Ruei Lee, a graduate student in Duke Biology. I’m analyzing many DNA alignments of a plant species.
I first used (Bio::PopGen::Utilities -> aln_to_population()) to read in the fasta format alignment, and then use Bio::PopGen::Statistics to calculate some statistics without outgroup. Most gene work fine, but I think a bug happened when it meets alignments like this:
I get this data set from other people. I guess due to the annotation program people used, the definition of coding sequence is much longer in genotype 1 than in other genotypes. This creates a long stretch of gap in the very beginning. Whenever Bio::PopGen meets this kind of genes, the number of singleton counts boost a lot - seems like the long stretch of sites with gap is also counted as singletons. Also, some Fu & Li statistics boosted. The “number of segregation sites” seems not to be affected. (And therefore, there are genes with hundreds of singleton sites but only a few total segregating sites.)
May be a possible bug in Bio::PopGen::Utilities when reading in the data? Or when calculating singletons?
Author Name: Jason Stajich (@hyphaltip) Original Redmine Issue: 3328, https://redmine.open-bio.org/issues/3328 Original Date: 2012-02-17 Original Assignee: Bioperl Guts
I am Cheng-Ruei Lee, a graduate student in Duke Biology. I’m analyzing many DNA alignments of a plant species. I first used (Bio::PopGen::Utilities -> aln_to_population()) to read in the fasta format alignment, and then use Bio::PopGen::Statistics to calculate some statistics without outgroup. Most gene work fine, but I think a bug happened when it meets alignments like this:
I get this data set from other people. I guess due to the annotation program people used, the definition of coding sequence is much longer in genotype 1 than in other genotypes. This creates a long stretch of gap in the very beginning. Whenever Bio::PopGen meets this kind of genes, the number of singleton counts boost a lot - seems like the long stretch of sites with gap is also counted as singletons. Also, some Fu & Li statistics boosted. The “number of segregation sites” seems not to be affected. (And therefore, there are genes with hundreds of singleton sites but only a few total segregating sites.) May be a possible bug in Bio::PopGen::Utilities when reading in the data? Or when calculating singletons?
Sincerely, Cheng-Ruei Lee cl134@duke.edu