eeg-ebe / HaplowebMaker

Automatic implementation of haplowebs and conspecificity matrices.
Apache License 2.0
1 stars 0 forks source link

Allow users to choose between presence/absence or inferred genotypes #8

Closed jflot closed 4 years ago

jflot commented 6 years ago

Add a setting allowing users to choose whether the circles' diameters should be based on the number of individuals possessing the corresponding haplotypes (presence/absence) or on the inferred frequency of this haplotype in the dataset (i.e., counting homozygous individuals twice).

quant42 commented 6 years ago

Run this script here https://github.com/quant42/Scripts/blob/master/bioinf/assumeDiploid.py first Usage: python assumeDiploid.py in.fa out.fa

jflot commented 6 years ago

It is nice, but many potential HaplowebMaker users do not know how to use the command line, hence the script should be incorporated in the pipeline with a simple box to tick in the "Advanced Options" menu.

jflot commented 6 years ago

When an individual has two identical sequences in the input FASTA file, under the default option that does not assume diploidy this sequence should be counted only one for this individual, not twice (since the default is to count presence/absence of individuals). This is important because some phasing pipelines output by default two sequences per individuals, even if the individual is homozygous. Also, it could happen that the two sequences of one individual differ only at positions that become masked because another individual has missing data in these columns: in that case, the two sequences will become a single haplotype, and this single haplotype should be counted only one in the default behaviour (but twice if the box "Count homozygous haplotypes twice" is ticked).

jflot commented 6 years ago

So we need four options here: 1) all circles shown with the same size 2) circle area represents the number of individuals harboring a given haplotype; 3) circle area represents the number of times a given sequence is found in the alignments; 4) circle area represents the inferred frequency of the haplotype in the population (i.e., counting each homozygous individual twice)

jflot commented 6 years ago

only 4. remains to be done

jflot commented 6 years ago

When choosing option 1, all the connecting curves should also be drawn with the same thickness (total thickness to be split between the different colors in case of multi-colored connections). Also in such case the portions of the pie charts should have equal sizes.