airr-community / gold-standard-datasets

Reference AIRR-Seq datasets for benchmarking tools
0 stars 0 forks source link

Rarely used genes #2

Open scharch opened 4 years ago

scharch commented 4 years ago

We'd like to test the "dynamic range" of annotation tools: how well are low-usage genes captured and what is the impact of large usage disparities between genes?

matsohlin commented 3 years ago

IGHV1-2_05, IGHV1-3_02, IGHV4-4_01 and IGHV7-4-1_01 are alleles that are transcribed at very low levels in comparison to other alleles of the same genes (e.g. IGHV1-2_02, _04, _06, _07; IGHV1-3_01; IGHV4-4_02, _07; IGHV7-4-1_02). IgM data sets are available for analysis of such alleles. For IGHV1-2_05 and IGHV4-4_01 these data sets are all heterozygous in combination different, highly expressed alleles. For IGHV1-3_02 and IGHV7-4-1_01 there is both heterozygous, homozygous and (possibly) hemizygous data sets. Further information about these data sets is available here: https://doi.org/10.3389/fimmu.2020.603980. This study focuses on those data sets that can be haplotyped based on heterozygosity of IGHJ6 but additional data sets likely also carry these low expressed alleles. IGHV1-2_05 and IGHV4-4_01 are expressed in 6 data sets (ERR2567187, ERR2567206, ERR2567231, ERR2567243, ERR2567249, ERR2567259). ERR2567231 and ERR2567259 also encode IGHV1-3_02, and ERR2567187, ERR2567243, and ERR2567249 also encode IGHV7-4-1_01, a fact that facilitates analysis as a smaller number of data sets can processed to produce more information on this matter.