airr-community / ogrdb

Website and associated database for managing submissions of inferred alleles
Other
8 stars 1 forks source link

Prototype genome statistics #47

Closed williamdlees closed 5 years ago

williamdlees commented 5 years ago

Produce a prototype genome statistics page using Andrew's suggested rules (allow adjustment of thresholds by the user)

The sequence must be present as at least 0.10% of all (unmutated?) sequences, unless the sequence is in the list of sequences that are not usually present on both chromosomes or that are known to be commonly present in the expressed (naïve) repertoire at very low frequency. (Not for example 0.06% rounded up to 0.1%. We need the cut-off to be 0.10%). For a subset of sequences that are usually present on just one chromosome, they must be present as at least 0.05% of all sequences. These sequences are: IGHV1-69-2, IGHV2-70D, IGHV3-43D, IGHV7-4-1. A number of other sequences are consistently seen at low frequency, and might be recorded as present if the frequency of unmutated sequences is at least 0.02%: IGHV1-45, IGHV4-28. (Any others?)

If there are two or more alleles of a gene in a genotype, the allele in question must be present as at least 20% of the (unmutated?) alignments to the gene. (If this is conservative, that is probably appropriate.)

The percentage of unmutated sequences aligned to a particular sequence (no. unmutated alignments to a sequence/ total alignments to the sequence x 100) should not be significantly different (p<0.05) to the overall percentage of unmutated sequences in the dataset.