Open slowkow opened 3 years ago
Thanks for your interest in our tool and for your constructive feedback today. Regarding your specific questions here:
How different are the genotypes called with arcasHLA between your runs? Is there a big discrepancy in every sample or in some samples only? Is there a big discrepancy in the first two fields or only in the third (or fourth) field? Are you genotyping class I, class II or both?
Now that I've read some of the code in scripts/extract.py
, I can see why I am getting more genotypes called when I discard the alt sequences.
Here are all of the sequence names from the GENCODE file:
$ grep '^>' GRCh38.p13.genome.fa
>chr1 1
>chr2 2
>chr3 3
>chr4 4
>chr5 5
>chr6 6
>chr7 7
>chr8 8
>chr9 9
>chr10 10
>chr11 11
>chr12 12
>chr13 13
>chr14 14
>chr15 15
>chr16 16
>chr17 17
>chr18 18
>chr19 19
>chr20 20
>chr21 21
>chr22 22
>chrX X
>chrY Y
>chrM MT
>GL000008.2 GL000008.2
>GL000009.2 GL000009.2
>GL000194.1 GL000194.1
>GL000195.1 GL000195.1
>GL000205.2 GL000205.2
>GL000208.1 GL000208.1
>GL000213.1 GL000213.1
>GL000214.1 GL000214.1
>GL000216.2 GL000216.2
>GL000218.1 GL000218.1
>GL000219.1 GL000219.1
>GL000220.1 GL000220.1
>GL000221.1 GL000221.1
>GL000224.1 GL000224.1
>GL000225.1 GL000225.1
>GL000226.1 GL000226.1
>KQ759759.1 HG107_PATCH
>ML143376.1 HG109_PATCH
>KN538364.1 HG126_PATCH
>ML143355.1 HG1277_PATCH
>ML143348.1 HG1296_PATCH
>ML143347.1 HG1298_PATCH
>ML143346.1 HG1299_PATCH
>ML143352.1 HG1309_PATCH
>KQ759762.1 HG1311_PATCH
>ML143375.1 HG1320_PATCH
>KQ031383.1 HG1342_HG2282_PATCH
>KN538369.1 HG1362_PATCH
>ML143342.1 HG1384_PATCH
>ML143350.1 HG1395_PATCH
>ML143362.1 HG1398_PATCH
>JH159136.1 HG142_HG150_NOVEL_TEST
>ML143357.1 HG1445_PATCH
>ML143385.1 HG1466_PATCH
>ML143378.1 HG1485_PATCH
>ML143382.1 HG1506_PATCH
>ML143383.1 HG1507_PATCH
>ML143384.1 HG1509_PATCH
>JH159137.1 HG151_NOVEL_TEST
>ML143356.1 HG1521_PATCH
>ML143364.1 HG1523_PATCH
>ML143365.1 HG1524_PATCH
>KZ208923.1 HG1531_PATCH
>KZ208924.1 HG1535_PATCH
>KQ031387.1 HG1651_PATCH
>KV766195.1 HG1708_PATCH
>KZ208916.1 HG1815_PATCH
>ML143363.1 HG1817_1_PATCH
>KN538360.1 HG1832_PATCH
>KZ208920.1 HG1_PATCH
>KZ208906.1 HG2002_PATCH
>KN196484.1 HG2021_PATCH
>KN196476.1 HG2022_PATCH
>KQ983257.1 HG2023_PATCH
>KN196479.1 HG2030_PATCH
>KV575245.1 HG2046_PATCH
>KZ208917.1 HG2047_PATCH
>KZ208911.1 HG2057_PATCH
>KN196473.1 HG2058_PATCH
>KZ559108.1 HG2060_PATCH
>KN196487.1 HG2062_PATCH
>KQ759760.1 HG2063_PATCH
>KN196475.1 HG2066_PATCH
>KV880766.1 HG2067_PATCH
>KV880767.1 HG2068_PATCH
>KQ090016.1 HG2072_PATCH
>ML143374.1 HG2087_PATCH
>KV880764.1 HG2088_PATCH
>KN538361.1 HG2095_PATCH
>KN196474.1 HG2104_PATCH
>ML143360.1 HG2111_PATCH
>KZ559109.1 HG2114_PATCH
>ML143359.1 HG2115_PATCH
>KQ090022.1 HG2116_PATCH
>KV766194.1 HG2121_PATCH
>KN196478.1 HG2128_PATCH
>KZ559104.1 HG2133_PATCH
>KN196480.1 HG2191_PATCH
>ML143370.1 HG2198_PATCH
>KQ090028.1 HG2213_PATCH
>KN196483.1 HG2216_PATCH
>KN196481.1 HG2217_PATCH
>KN538363.1 HG2232_PATCH
>KN538362.1 HG2233_PATCH
>KQ031385.1 HG2235_PATCH
>KV766192.1 HG2236_PATCH
>KQ031386.1 HG2237_PATCH
>KQ031388.1 HG2239_PATCH
>KN538365.1 HG2241_PATCH
>KN538366.1 HG2242_HG2243_PATCH
>KN538367.1 HG2244_HG2245_PATCH
>ML143361.1 HG2246_HG2248_HG2276_PATCH
>KN538370.1 HG2247_PATCH
>KN538373.1 HG2249_PATCH
>KZ559113.1 HG2263_PATCH
>KV880765.1 HG2266_PATCH
>KV766196.1 HG2285_HG106_HG2252_PATCH
>KN538371.1 HG2288_HG2289_PATCH
>KQ031384.1 HG2290_PATCH
>KN538372.1 HG2291_PATCH
>KQ090021.1 HG2334_PATCH
>ML143371.1 HG2365_PATCH
>KN196482.1 HG23_PATCH
>KZ559115.1 HG2412_PATCH
>KZ208914.1 HG2419_PATCH
>KZ208922.1 HG2442_PATCH
>ML143373.1 HG2471_PATCH
>ML143369.1 HG2499_PATCH
>ML143366.1 HG2509_PATCH
>ML143367.1 HG2510_PATCH
>ML143372.1 HG2511_PATCH
>ML143380.1 HG2512_PATCH
>ML143377.1 HG2513_PATCH
>ML143345.1 HG2525_PATCH
>KQ458386.1 HG26_PATCH
>ML143358.1 HG28_PATCH
>KV575244.1 HG30_PATCH
>ML143381.1 HG439_PATCH
>KZ559100.1 HG460_PATCH
>ML143379.1 HG494_PATCH
>ML143354.1 HG545_PATCH
>ML143351.1 HG563_PATCH
>ML143353.1 HG613_PATCH
>ML143344.1 HG699_PATCH
>ML143349.1 HG705_PATCH
>KZ208912.1 HG708_PATCH
>ML143341.1 HG721_PATCH
>KZ208915.1 HG76_PATCH
>KV880768.1 HG926_PATCH
>KN196472.1 HG986_PATCH
>GL383545.1 HSCHR10_1_CTG1
>GL383546.1 HSCHR10_1_CTG2
>KI270824.1 HSCHR10_1_CTG3
>KI270825.1 HSCHR10_1_CTG4
>KQ090020.1 HSCHR10_1_CTG6
>GL383547.1 HSCHR11_1_CTG1_1
>KN538368.1 HSCHR11_1_CTG1_2
>KI270826.1 HSCHR11_1_CTG2
>KI270827.1 HSCHR11_1_CTG3
>KZ559111.1 HSCHR11_1_CTG3_1
>KI270829.1 HSCHR11_1_CTG5
>KI270830.1 HSCHR11_1_CTG6
>KI270831.1 HSCHR11_1_CTG7
>KI270832.1 HSCHR11_1_CTG8
>KI270902.1 HSCHR11_2_CTG1
>KI270903.1 HSCHR11_2_CTG1_1
>KZ559110.1 HSCHR11_2_CTG8
>KI270927.1 HSCHR11_3_CTG1
>GL877875.1 HSCHR12_1_CTG1
>GL383549.1 HSCHR12_1_CTG2
>GL383550.2 HSCHR12_1_CTG2_1
>KQ090023.1 HSCHR12_2_CTG1
>GL877876.1 HSCHR12_2_CTG2
>GL383552.1 HSCHR12_2_CTG2_1
>KI270904.1 HSCHR12_3_CTG2
>GL383553.2 HSCHR12_3_CTG2_1
>KI270835.1 HSCHR12_4_CTG2
>GL383551.1 HSCHR12_4_CTG2_1
>KI270837.1 HSCHR12_5_CTG2
>KI270833.1 HSCHR12_5_CTG2_1
>KI270834.1 HSCHR12_6_CTG2_1
>KI270836.1 HSCHR12_7_CTG2_1
>KZ208918.1 HSCHR12_8_CTG2_1
>KZ559112.1 HSCHR12_9_CTG2_1
>KI270838.1 HSCHR13_1_CTG1
>KI270839.1 HSCHR13_1_CTG2
>KI270840.1 HSCHR13_1_CTG3
>KI270841.1 HSCHR13_1_CTG4
>KI270842.1 HSCHR13_1_CTG5
>KI270843.1 HSCHR13_1_CTG6
>KQ090024.1 HSCHR13_1_CTG7
>KQ090025.1 HSCHR13_1_CTG8
>KI270844.1 HSCHR14_1_CTG1
>KI270845.1 HSCHR14_2_CTG1
>KI270846.1 HSCHR14_3_CTG1
>KI270847.1 HSCHR14_7_CTG1
>KZ208919.1 HSCHR14_8_CTG1
>ML143368.1 HSCHR14_9_CTG1
>KI270852.1 HSCHR15_1_CTG1
>KI270848.1 HSCHR15_1_CTG3
>GL383554.1 HSCHR15_1_CTG8
>KI270906.1 HSCHR15_2_CTG3
>GL383555.2 HSCHR15_2_CTG8
>KI270851.1 HSCHR15_3_CTG3
>KI270849.1 HSCHR15_3_CTG8
>KI270905.1 HSCHR15_4_CTG8
>KI270850.1 HSCHR15_5_CTG8
>KQ031389.1 HSCHR15_6_CTG8
>KI270853.1 HSCHR16_1_CTG1
>GL383556.1 HSCHR16_1_CTG3_1
>GL383557.1 HSCHR16_2_CTG3_1
>KI270855.1 HSCHR16_3_CTG1
>KQ031390.1 HSCHR16_3_CTG3_1
>KI270856.1 HSCHR16_4_CTG1
>KQ090027.1 HSCHR16_4_CTG3_1
>KQ090026.1 HSCHR16_5_CTG1
>KZ208921.1 HSCHR16_5_CTG3_1
>KI270854.1 HSCHR16_CTG2
>KI270909.1 HSCHR17_10_CTG4
>KV766197.1 HSCHR17_11_CTG4
>KZ559114.1 HSCHR17_12_CTG4
>GL383563.3 HSCHR17_1_CTG1
>KI270861.1 HSCHR17_1_CTG2
>GL383564.2 HSCHR17_1_CTG4
>GL000258.2 HSCHR17_1_CTG5
>KI270860.1 HSCHR17_1_CTG9
>KI270907.1 HSCHR17_2_CTG1
>KI270862.1 HSCHR17_2_CTG2
>GL383565.1 HSCHR17_2_CTG4
>KI270908.1 HSCHR17_2_CTG5
>KV766198.1 HSCHR17_3_CTG1
>KI270910.1 HSCHR17_3_CTG2
>GL383566.1 HSCHR17_3_CTG4
>JH159146.1 HSCHR17_4_CTG4
>JH159147.1 HSCHR17_5_CTG4
>JH159148.1 HSCHR17_6_CTG4
>KI270857.1 HSCHR17_7_CTG4
>KI270858.1 HSCHR17_8_CTG4
>KI270859.1 HSCHR17_9_CTG4
>KZ559116.1 HSCHR18_1_CTG1
>GL383567.1 HSCHR18_1_CTG1_1
>GL383568.1 HSCHR18_1_CTG2
>GL383569.1 HSCHR18_1_CTG2_1
>GL383570.1 HSCHR18_2_CTG1_1
>GL383571.1 HSCHR18_2_CTG2
>GL383572.1 HSCHR18_2_CTG2_1
>KI270863.1 HSCHR18_3_CTG2_1
>KI270864.1 HSCHR18_4_CTG1_1
>KQ458385.1 HSCHR18_5_CTG1_1
>KI270912.1 HSCHR18_ALT21_CTG2_1
>KI270911.1 HSCHR18_ALT2_CTG2_1
>KV575254.1 HSCHR19KIR_0010-5217-AB_CTG3_1
>KV575246.1 HSCHR19KIR_0019-4656-A_CTG3_1
>KV575256.1 HSCHR19KIR_0019-4656-B_CTG3_1
>KV575253.1 HSCHR19KIR_502960008-1_CTG3_1
>KV575252.1 HSCHR19KIR_502960008-2_CTG3_1
>KV575255.1 HSCHR19KIR_7191059-1_CTG3_1
>KV575259.1 HSCHR19KIR_7191059-2_CTG3_1
>KI270917.1 HSCHR19KIR_ABC08_A1_HAP_CTG3_1
>KI270918.1 HSCHR19KIR_ABC08_AB_HAP_C_P_CTG3_1
>KI270919.1 HSCHR19KIR_ABC08_AB_HAP_T_P_CTG3_1
>KV575247.1 HSCHR19KIR_CA01-TA01_1_CTG3_1
>KV575248.1 HSCHR19KIR_CA01-TA01_2_CTG3_1
>KV575250.1 HSCHR19KIR_CA01-TB01_CTG3_1
>KV575249.1 HSCHR19KIR_CA01-TB04_CTG3_1
>KV575257.1 HSCHR19KIR_CA04_CTG3_1
>KI270920.1 HSCHR19KIR_FH05_A_HAP_CTG3_1
>KI270921.1 HSCHR19KIR_FH05_B_HAP_CTG3_1
>KI270922.1 HSCHR19KIR_FH06_A_HAP_CTG3_1
>KI270923.1 HSCHR19KIR_FH06_BA1_HAP_CTG3_1
>KI270929.1 HSCHR19KIR_FH08_A_HAP_CTG3_1
>KI270930.1 HSCHR19KIR_FH08_BAX_HAP_CTG3_1
>KI270931.1 HSCHR19KIR_FH13_A_HAP_CTG3_1
>KI270932.1 HSCHR19KIR_FH13_BA2_HAP_CTG3_1
>KI270933.1 HSCHR19KIR_FH15_A_HAP_CTG3_1
>KI270882.1 HSCHR19KIR_FH15_B_HAP_CTG3_1
>KI270883.1 HSCHR19KIR_G085_A_HAP_CTG3_1
>KI270884.1 HSCHR19KIR_G085_BA1_HAP_CTG3_1
>KI270885.1 HSCHR19KIR_G248_A_HAP_CTG3_1
>KI270886.1 HSCHR19KIR_G248_BA2_HAP_CTG3_1
>KI270887.1 HSCHR19KIR_GRC212_AB_HAP_CTG3_1
>KI270888.1 HSCHR19KIR_GRC212_BA1_HAP_CTG3_1
>KV575258.1 HSCHR19KIR_HG2393_CTG3_1
>KV575251.1 HSCHR19KIR_HG2394_CTG3_1
>KV575260.1 HSCHR19KIR_HG2396_CTG3_1
>KI270889.1 HSCHR19KIR_LUCE_A_HAP_CTG3_1
>KI270890.1 HSCHR19KIR_LUCE_BDEL_HAP_CTG3_1
>GL000209.2 HSCHR19KIR_RP5_B_HAP_CTG3_1
>KI270891.1 HSCHR19KIR_RSH_A_HAP_CTG3_1
>KI270914.1 HSCHR19KIR_RSH_BA2_HAP_CTG3_1
>KI270915.1 HSCHR19KIR_T7526_A_HAP_CTG3_1
>KI270916.1 HSCHR19KIR_T7526_BDEL_HAP_CTG3_1
>GL949746.1 HSCHR19LRC_COX1_CTG3_1
>GL949747.2 HSCHR19LRC_COX2_CTG3_1
>GL949748.2 HSCHR19LRC_LRC_I_CTG3_1
>GL949749.2 HSCHR19LRC_LRC_J_CTG3_1
>GL949750.2 HSCHR19LRC_LRC_S_CTG3_1
>GL949751.2 HSCHR19LRC_LRC_T_CTG3_1
>GL949752.1 HSCHR19LRC_PGF1_CTG3_1
>GL949753.2 HSCHR19LRC_PGF2_CTG3_1
>GL383573.1 HSCHR19_1_CTG2
>GL383574.1 HSCHR19_1_CTG3_1
>GL383575.2 HSCHR19_2_CTG2
>KI270866.1 HSCHR19_2_CTG3_1
>GL383576.1 HSCHR19_3_CTG2
>KI270867.1 HSCHR19_3_CTG3_1
>KI270865.1 HSCHR19_4_CTG2
>KI270938.1 HSCHR19_4_CTG3_1
>KI270868.1 HSCHR19_5_CTG2
>KI270760.1 HSCHR1_1_CTG11
>KI270762.1 HSCHR1_1_CTG3
>GL383518.1 HSCHR1_1_CTG31
>KI270759.1 HSCHR1_1_CTG32_1
>KI270766.1 HSCHR1_2_CTG3
>GL383519.1 HSCHR1_2_CTG31
>KI270761.1 HSCHR1_2_CTG32_1
>KQ458382.1 HSCHR1_3_CTG3
>GL383520.2 HSCHR1_3_CTG31
>KI270763.1 HSCHR1_3_CTG32_1
>KQ458383.1 HSCHR1_4_CTG3
>KI270765.1 HSCHR1_4_CTG31
>KI270764.1 HSCHR1_4_CTG32_1
>KQ983255.1 HSCHR1_5_CTG3
>KQ458384.1 HSCHR1_5_CTG32_1
>KV880763.1 HSCHR1_6_CTG3
>KZ208904.1 HSCHR1_8_CTG3
>KZ208905.1 HSCHR1_9_CTG3
>KI270892.1 HSCHR1_ALT2_1_CTG32_1
>GL383577.2 HSCHR20_1_CTG1
>KI270869.1 HSCHR20_1_CTG2
>KI270870.1 HSCHR20_1_CTG3
>KI270871.1 HSCHR20_1_CTG4
>GL383578.2 HSCHR21_1_CTG1_1
>GL383579.2 HSCHR21_2_CTG1_1
>GL383580.2 HSCHR21_3_CTG1_1
>GL383581.2 HSCHR21_4_CTG1_1
>KI270872.1 HSCHR21_5_CTG2
>KI270873.1 HSCHR21_6_CTG1_1
>KI270874.1 HSCHR21_8_CTG1_1
>GL383582.2 HSCHR22_1_CTG1
>GL383583.2 HSCHR22_1_CTG2
>KI270875.1 HSCHR22_1_CTG3
>KI270876.1 HSCHR22_1_CTG4
>KI270877.1 HSCHR22_1_CTG5
>KI270878.1 HSCHR22_1_CTG6
>KI270879.1 HSCHR22_1_CTG7
>KB663609.1 HSCHR22_2_CTG1
>KI270928.1 HSCHR22_3_CTG1
>KN196485.1 HSCHR22_4_CTG1
>KN196486.1 HSCHR22_5_CTG1
>KQ458387.1 HSCHR22_6_CTG1
>KQ458388.1 HSCHR22_7_CTG1
>KQ759761.1 HSCHR22_8_CTG1
>KI270769.1 HSCHR2_1_CTG1
>KI270767.1 HSCHR2_1_CTG15
>GL383521.1 HSCHR2_1_CTG5
>KI270772.1 HSCHR2_1_CTG7
>GL383522.1 HSCHR2_1_CTG7_2
>KI270770.1 HSCHR2_2_CTG1
>KI270893.1 HSCHR2_2_CTG15
>KI270894.1 HSCHR2_2_CTG7
>GL582966.2 HSCHR2_2_CTG7_2
>KI270773.1 HSCHR2_3_CTG1
>KI270776.1 HSCHR2_3_CTG15
>KI270768.1 HSCHR2_3_CTG7_2
>KI270774.1 HSCHR2_4_CTG1
>KI270771.1 HSCHR2_4_CTG7_2
>KI270775.1 HSCHR2_5_CTG7_2
>KQ983256.1 HSCHR2_6_CTG7_2
>KZ208907.1 HSCHR2_7_CTG7_2
>KZ208908.1 HSCHR2_8_CTG7_2
>JH636055.2 HSCHR3_1_CTG1
>GL383526.1 HSCHR3_1_CTG2_1
>KI270779.1 HSCHR3_1_CTG3
>KI270777.1 HSCHR3_2_CTG2_1
>KI270782.1 HSCHR3_2_CTG3
>KI270783.1 HSCHR3_3_CTG1
>KI270778.1 HSCHR3_3_CTG2_1
>KI270895.1 HSCHR3_3_CTG3
>KZ208909.1 HSCHR3_4_CTG1
>KI270780.1 HSCHR3_4_CTG2_1
>KI270924.1 HSCHR3_4_CTG3
>ML143343.1 HSCHR3_5_CTG1
>KI270781.1 HSCHR3_5_CTG2_1
>KI270934.1 HSCHR3_5_CTG3
>KZ559105.1 HSCHR3_6_CTG2_1
>KI270935.1 HSCHR3_6_CTG3
>KZ559101.1 HSCHR3_7_CTG2_1
>KI270936.1 HSCHR3_7_CTG3
>KZ559102.1 HSCHR3_8_CTG2_1
>KI270937.1 HSCHR3_8_CTG3
>KZ559103.1 HSCHR3_9_CTG2_1
>KI270784.1 HSCHR3_9_CTG3
>KQ983258.1 HSCHR4_11_CTG12
>KV766193.1 HSCHR4_12_CTG12
>GL383527.1 HSCHR4_1_CTG12
>KI270790.1 HSCHR4_1_CTG4
>GL383528.1 HSCHR4_1_CTG6
>KI270787.1 HSCHR4_1_CTG8_1
>GL000257.2 HSCHR4_1_CTG9
>KI270785.1 HSCHR4_2_CTG12
>KQ090013.1 HSCHR4_2_CTG4
>KI270786.1 HSCHR4_3_CTG12
>KI270788.1 HSCHR4_4_CTG12
>KI270789.1 HSCHR4_5_CTG12
>KI270896.1 HSCHR4_6_CTG12
>KI270925.1 HSCHR4_7_CTG12
>KQ090014.1 HSCHR4_8_CTG12
>KQ090015.1 HSCHR4_9_CTG12
>GL383532.1 HSCHR5_1_CTG1
>KI270897.1 HSCHR5_1_CTG1_1
>GL383531.1 HSCHR5_1_CTG5
>GL949742.1 HSCHR5_2_CTG1
>GL339449.2 HSCHR5_2_CTG1_1
>KI270795.1 HSCHR5_2_CTG5
>KI270791.1 HSCHR5_3_CTG1
>GL383530.1 HSCHR5_3_CTG1_1
>KI270898.1 HSCHR5_3_CTG5
>KI270792.1 HSCHR5_4_CTG1
>KI270796.1 HSCHR5_4_CTG1_1
>KI270793.1 HSCHR5_5_CTG1
>KI270794.1 HSCHR5_6_CTG1
>KN196477.1 HSCHR5_7_CTG1
>KV575243.1 HSCHR5_8_CTG1
>KZ208910.1 HSCHR5_9_CTG1
>KQ090017.1 HSCHR6_1_CTG10
>GL383533.1 HSCHR6_1_CTG2
>KB021644.2 HSCHR6_1_CTG3
>KI270797.1 HSCHR6_1_CTG4
>KI270798.1 HSCHR6_1_CTG5
>KI270799.1 HSCHR6_1_CTG6
>KI270800.1 HSCHR6_1_CTG7
>KI270801.1 HSCHR6_1_CTG8
>KI270802.1 HSCHR6_1_CTG9
>KI270758.1 HSCHR6_8_CTG1
>GL000250.2 HSCHR6_MHC_APD_CTG1
>GL000251.2 HSCHR6_MHC_COX_CTG1
>GL000252.2 HSCHR6_MHC_DBB_CTG1
>GL000253.2 HSCHR6_MHC_MANN_CTG1
>GL000254.2 HSCHR6_MHC_MCF_CTG1
>GL000255.2 HSCHR6_MHC_QBL_CTG1
>GL000256.2 HSCHR6_MHC_SSTO_CTG1
>KI270804.1 HSCHR7_1_CTG1
>KI270806.1 HSCHR7_1_CTG4_4
>GL383534.2 HSCHR7_1_CTG6
>KI270805.1 HSCHR7_1_CTG7
>KI270899.1 HSCHR7_2_CTG1
>KI270809.1 HSCHR7_2_CTG4_4
>KI270803.1 HSCHR7_2_CTG6
>KI270807.1 HSCHR7_2_CTG7
>KZ559106.1 HSCHR7_3_CTG1
>KZ208913.1 HSCHR7_3_CTG4_4
>KI270808.1 HSCHR7_3_CTG6
>KI270811.1 HSCHR8_1_CTG1
>KI270814.1 HSCHR8_1_CTG6
>KI270810.1 HSCHR8_1_CTG7
>KI270812.1 HSCHR8_2_CTG1
>KI270815.1 HSCHR8_2_CTG7
>KI270813.1 HSCHR8_3_CTG1
>KI270816.1 HSCHR8_3_CTG7
>KI270818.1 HSCHR8_4_CTG1
>KI270817.1 HSCHR8_4_CTG7
>KI270900.1 HSCHR8_5_CTG1
>KI270819.1 HSCHR8_5_CTG7
>KI270901.1 HSCHR8_6_CTG1
>KI270820.1 HSCHR8_6_CTG7
>KI270926.1 HSCHR8_7_CTG1
>KZ559107.1 HSCHR8_7_CTG7
>KI270821.1 HSCHR8_8_CTG1
>KI270822.1 HSCHR8_9_CTG1
>GL383539.1 HSCHR9_1_CTG1
>GL383540.1 HSCHR9_1_CTG2
>GL383541.1 HSCHR9_1_CTG3
>GL383542.1 HSCHR9_1_CTG4
>KI270823.1 HSCHR9_1_CTG5
>KQ090018.1 HSCHR9_1_CTG6
>KQ090019.1 HSCHR9_1_CTG7
>KI270880.1 HSCHRX_1_CTG3
>KI270881.1 HSCHRX_2_CTG12
>KI270913.1 HSCHRX_2_CTG3
>KV766199.1 HSCHRX_3_CTG7
>KI270302.1 KI270302.1
>KI270303.1 KI270303.1
>KI270304.1 KI270304.1
>KI270305.1 KI270305.1
>KI270310.1 KI270310.1
>KI270311.1 KI270311.1
>KI270312.1 KI270312.1
>KI270315.1 KI270315.1
>KI270316.1 KI270316.1
>KI270317.1 KI270317.1
>KI270320.1 KI270320.1
>KI270322.1 KI270322.1
>KI270329.1 KI270329.1
>KI270330.1 KI270330.1
>KI270333.1 KI270333.1
>KI270334.1 KI270334.1
>KI270335.1 KI270335.1
>KI270336.1 KI270336.1
>KI270337.1 KI270337.1
>KI270338.1 KI270338.1
>KI270340.1 KI270340.1
>KI270362.1 KI270362.1
>KI270363.1 KI270363.1
>KI270364.1 KI270364.1
>KI270366.1 KI270366.1
>KI270371.1 KI270371.1
>KI270372.1 KI270372.1
>KI270373.1 KI270373.1
>KI270374.1 KI270374.1
>KI270375.1 KI270375.1
>KI270376.1 KI270376.1
>KI270378.1 KI270378.1
>KI270379.1 KI270379.1
>KI270381.1 KI270381.1
>KI270382.1 KI270382.1
>KI270383.1 KI270383.1
>KI270384.1 KI270384.1
>KI270385.1 KI270385.1
>KI270386.1 KI270386.1
>KI270387.1 KI270387.1
>KI270388.1 KI270388.1
>KI270389.1 KI270389.1
>KI270390.1 KI270390.1
>KI270391.1 KI270391.1
>KI270392.1 KI270392.1
>KI270393.1 KI270393.1
>KI270394.1 KI270394.1
>KI270395.1 KI270395.1
>KI270396.1 KI270396.1
>KI270411.1 KI270411.1
>KI270412.1 KI270412.1
>KI270414.1 KI270414.1
>KI270417.1 KI270417.1
>KI270418.1 KI270418.1
>KI270419.1 KI270419.1
>KI270420.1 KI270420.1
>KI270422.1 KI270422.1
>KI270423.1 KI270423.1
>KI270424.1 KI270424.1
>KI270425.1 KI270425.1
>KI270429.1 KI270429.1
>KI270435.1 KI270435.1
>KI270438.1 KI270438.1
>KI270442.1 KI270442.1
>KI270448.1 KI270448.1
>KI270465.1 KI270465.1
>KI270466.1 KI270466.1
>KI270467.1 KI270467.1
>KI270468.1 KI270468.1
>KI270507.1 KI270507.1
>KI270508.1 KI270508.1
>KI270509.1 KI270509.1
>KI270510.1 KI270510.1
>KI270511.1 KI270511.1
>KI270512.1 KI270512.1
>KI270515.1 KI270515.1
>KI270516.1 KI270516.1
>KI270517.1 KI270517.1
>KI270518.1 KI270518.1
>KI270519.1 KI270519.1
>KI270521.1 KI270521.1
>KI270522.1 KI270522.1
>KI270528.1 KI270528.1
>KI270529.1 KI270529.1
>KI270530.1 KI270530.1
>KI270538.1 KI270538.1
>KI270539.1 KI270539.1
>KI270544.1 KI270544.1
>KI270548.1 KI270548.1
>KI270579.1 KI270579.1
>KI270580.1 KI270580.1
>KI270581.1 KI270581.1
>KI270582.1 KI270582.1
>KI270583.1 KI270583.1
>KI270584.1 KI270584.1
>KI270587.1 KI270587.1
>KI270588.1 KI270588.1
>KI270589.1 KI270589.1
>KI270590.1 KI270590.1
>KI270591.1 KI270591.1
>KI270593.1 KI270593.1
>KI270706.1 KI270706.1
>KI270707.1 KI270707.1
>KI270708.1 KI270708.1
>KI270709.1 KI270709.1
>KI270710.1 KI270710.1
>KI270711.1 KI270711.1
>KI270712.1 KI270712.1
>KI270713.1 KI270713.1
>KI270714.1 KI270714.1
>KI270715.1 KI270715.1
>KI270716.1 KI270716.1
>KI270717.1 KI270717.1
>KI270718.1 KI270718.1
>KI270719.1 KI270719.1
>KI270720.1 KI270720.1
>KI270721.1 KI270721.1
>KI270722.1 KI270722.1
>KI270723.1 KI270723.1
>KI270724.1 KI270724.1
>KI270725.1 KI270725.1
>KI270726.1 KI270726.1
>KI270727.1 KI270727.1
>KI270728.1 KI270728.1
>KI270729.1 KI270729.1
>KI270730.1 KI270730.1
>KI270731.1 KI270731.1
>KI270732.1 KI270732.1
>KI270733.1 KI270733.1
>KI270734.1 KI270734.1
>KI270735.1 KI270735.1
>KI270736.1 KI270736.1
>KI270737.1 KI270737.1
>KI270738.1 KI270738.1
>KI270739.1 KI270739.1
>KI270740.1 KI270740.1
>KI270741.1 KI270741.1
>KI270742.1 KI270742.1
>KI270743.1 KI270743.1
>KI270744.1 KI270744.1
>KI270745.1 KI270745.1
>KI270746.1 KI270746.1
>KI270747.1 KI270747.1
>KI270748.1 KI270748.1
>KI270749.1 KI270749.1
>KI270750.1 KI270750.1
>KI270751.1 KI270751.1
>KI270752.1 KI270752.1
>KI270753.1 KI270753.1
>KI270754.1 KI270754.1
>KI270755.1 KI270755.1
>KI270756.1 KI270756.1
>KI270757.1 KI270757.1
Here's a shorter list, just the ones with chr6
in the name:
$ cat GRCh38.p13.genome.names.txt | grep -i chr6
>chr6 6
>KQ090017.1 HSCHR6_1_CTG10
>GL383533.1 HSCHR6_1_CTG2
>KB021644.2 HSCHR6_1_CTG3
>KI270797.1 HSCHR6_1_CTG4
>KI270798.1 HSCHR6_1_CTG5
>KI270799.1 HSCHR6_1_CTG6
>KI270800.1 HSCHR6_1_CTG7
>KI270801.1 HSCHR6_1_CTG8
>KI270802.1 HSCHR6_1_CTG9
>KI270758.1 HSCHR6_8_CTG1
>GL000250.2 HSCHR6_MHC_APD_CTG1
>GL000251.2 HSCHR6_MHC_COX_CTG1
>GL000252.2 HSCHR6_MHC_DBB_CTG1
>GL000253.2 HSCHR6_MHC_MANN_CTG1
>GL000254.2 HSCHR6_MHC_MCF_CTG1
>GL000255.2 HSCHR6_MHC_QBL_CTG1
>GL000256.2 HSCHR6_MHC_SSTO_CTG1
Here is the full content of the file dat/info/decoys_alts.p
:
HSCHR6_MHC_APD HSCHR6_MHC_COX HSCHR6_MHC_DBB HSCHR6_MHC_MANN HSCHR6_MHC_MCF HSCHR6_MHC_QBL HSCHR6_MHC_SSTO HLA-A*01:01:01:01 HLA-A*01:01:01:02N HLA-A*01:01:38L HLA-A*01:02 HLA-A*01:03 HLA-A*01:04N HLA-A*01:09 HLA-A*01:11N HLA-A*01:14 HLA-A*01:16N HLA-A*01:20 HLA-A*02:01:01:01 HLA-A*02:01:01:02L HLA-A*02:01:01:03 HLA-A*02:01:01:04 HLA-A*02:02:01 HLA-A*02:03:01 HLA-A*02:03:03 HLA-A*02:05:01 HLA-A*02:06:01 HLA-A*02:07:01 HLA-A*02:10 HLA-A*02:251 HLA-A*02:259 HLA-A*02:264 HLA-A*02:265 HLA-A*02:266 HLA-A*02:269 HLA-A*02:279 HLA-A*02:32N HLA-A*02:376 HLA-A*02:43N HLA-A*02:455 HLA-A*02:48 HLA-A*02:51 HLA-A*02:533 HLA-A*02:53N HLA-A*02:57 HLA-A*02:60:01 HLA-A*02:65 HLA-A*02:68 HLA-A*02:77 HLA-A*02:81 HLA-A*02:89 HLA-A*02:95 HLA-A*03:01:01:01 HLA-A*03:01:01:02N HLA-A*03:01:01:03 HLA-A*03:02:01 HLA-A*03:11N HLA-A*03:21N HLA-A*03:36N HLA-A*11:01:01 HLA-A*11:01:18 HLA-A*11:02:01 HLA-A*11:05 HLA-A*11:110 HLA-A*11:25 HLA-A*11:50Q HLA-A*11:60 HLA-A*11:69N HLA-A*11:74 HLA-A*11:75 HLA-A*11:77 HLA-A*23:01:01 HLA-A*23:09 HLA-A*23:38N HLA-A*24:02:01:01 HLA-A*24:02:01:02L HLA-A*24:02:01:03 HLA-A*24:02:03Q HLA-A*24:02:10 HLA-A*24:03:01 HLA-A*24:07:01 HLA-A*24:08 HLA-A*24:09N HLA-A*24:10:01 HLA-A*24:11N HLA-A*24:152 HLA-A*24:20 HLA-A*24:215 HLA-A*24:61 HLA-A*24:86N HLA-A*25:01:01 HLA-A*26:01:01 HLA-A*26:11N HLA-A*26:15 HLA-A*26:50 HLA-A*29:01:01:01 HLA-A*29:01:01:02N HLA-A*29:02:01:01 HLA-A*29:02:01:02 HLA-A*29:46 HLA-A*30:01:01 HLA-A*30:02:01:01 HLA-A*30:02:01:02 HLA-A*30:04:01 HLA-A*30:89 HLA-A*31:01:02 HLA-A*31:01:23 HLA-A*31:04 HLA-A*31:14N HLA-A*31:46 HLA-A*32:01:01 HLA-A*32:06 HLA-A*33:01:01 HLA-A*33:03:01 HLA-A*33:07 HLA-A*34:01:01 HLA-A*34:02:01 HLA-A*36:01 HLA-A*43:01 HLA-A*66:01:01 HLA-A*66:17 HLA-A*68:01:01:01 HLA-A*68:01:01:02 HLA-A*68:01:02:01 HLA-A*68:01:02:02 HLA-A*68:02:01:01 HLA-A*68:02:01:02 HLA-A*68:02:01:03 HLA-A*68:02:02 HLA-A*68:03:01 HLA-A*68:08:01 HLA-A*68:113 HLA-A*68:17 HLA-A*68:18N HLA-A*68:22 HLA-A*68:71 HLA-A*69:01 HLA-A*74:01 HLA-A*74:02:01:01 HLA-A*74:02:01:02 HLA-A*80:01:01:01 HLA-A*80:01:01:02 HLA-B*07:02:01 HLA-B*07:05:01 HLA-B*07:06 HLA-B*07:156 HLA-B*07:33:01 HLA-B*07:41 HLA-B*07:44 HLA-B*07:50 HLA-B*08:01:01 HLA-B*08:08N HLA-B*08:132 HLA-B*08:134 HLA-B*08:19N HLA-B*08:20 HLA-B*08:33 HLA-B*08:79 HLA-B*13:01:01 HLA-B*13:02:01 HLA-B*13:02:03 HLA-B*13:02:09 HLA-B*13:08 HLA-B*13:15 HLA-B*13:25 HLA-B*14:01:01 HLA-B*14:02:01 HLA-B*14:07N HLA-B*15:01:01:01 HLA-B*15:01:01:02N HLA-B*15:01:01:03 HLA-B*15:02:01 HLA-B*15:03:01 HLA-B*15:04:01 HLA-B*15:07:01 HLA-B*15:108 HLA-B*15:10:01 HLA-B*15:11:01 HLA-B*15:13:01 HLA-B*15:16:01 HLA-B*15:17:01:01 HLA-B*15:17:01:02 HLA-B*15:18:01 HLA-B*15:220 HLA-B*15:25:01 HLA-B*15:27:01 HLA-B*15:32:01 HLA-B*15:42 HLA-B*15:58 HLA-B*15:66 HLA-B*15:77 HLA-B*15:83 HLA-B*18:01:01:01 HLA-B*18:01:01:02 HLA-B*18:02 HLA-B*18:03 HLA-B*18:17N HLA-B*18:26 HLA-B*18:94N HLA-B*27:04:01 HLA-B*27:05:02 HLA-B*27:05:18 HLA-B*27:06 HLA-B*27:07:01 HLA-B*27:131 HLA-B*27:24 HLA-B*27:25 HLA-B*27:32 HLA-B*35:01:01:01 HLA-B*35:01:01:02 HLA-B*35:01:22 HLA-B*35:02:01 HLA-B*35:03:01 HLA-B*35:05:01 HLA-B*35:08:01 HLA-B*35:14:02 HLA-B*35:241 HLA-B*35:41 HLA-B*37:01:01 HLA-B*37:01:05 HLA-B*38:01:01 HLA-B*38:02:01 HLA-B*38:14 HLA-B*39:01:01:01 HLA-B*39:01:01:02L HLA-B*39:01:01:03 HLA-B*39:01:03 HLA-B*39:01:16 HLA-B*39:01:21 HLA-B*39:05:01 HLA-B*39:06:02 HLA-B*39:10:01 HLA-B*39:13:02 HLA-B*39:14 HLA-B*39:34 HLA-B*39:38Q HLA-B*40:01:01 HLA-B*40:01:02 HLA-B*40:02:01 HLA-B*40:03 HLA-B*40:06:01:01 HLA-B*40:06:01:02 HLA-B*40:10:01 HLA-B*40:150 HLA-B*40:40 HLA-B*40:72:01 HLA-B*40:79 HLA-B*41:01:01 HLA-B*41:02:01 HLA-B*42:01:01 HLA-B*42:02 HLA-B*42:08 HLA-B*44:02:01:01 HLA-B*44:02:01:02S HLA-B*44:02:01:03 HLA-B*44:02:17 HLA-B*44:02:27 HLA-B*44:03:01 HLA-B*44:03:02 HLA-B*44:04 HLA-B*44:09 HLA-B*44:138Q HLA-B*44:150 HLA-B*44:23N HLA-B*44:26 HLA-B*44:46 HLA-B*44:49 HLA-B*44:56N HLA-B*45:01:01 HLA-B*45:04 HLA-B*46:01:01 HLA-B*46:01:05 HLA-B*47:01:01:01 HLA-B*47:01:01:02 HLA-B*48:01:01 HLA-B*48:03:01 HLA-B*48:04 HLA-B*48:08 HLA-B*49:01:01 HLA-B*49:32 HLA-B*50:01:01 HLA-B*51:01:01 HLA-B*51:01:02 HLA-B*51:02:01 HLA-B*51:07:01 HLA-B*51:42 HLA-B*52:01:01:01 HLA-B*52:01:01:02 HLA-B*52:01:01:03 HLA-B*52:01:02 HLA-B*53:01:01 HLA-B*53:11 HLA-B*54:01:01 HLA-B*54:18 HLA-B*55:01:01 HLA-B*55:01:03 HLA-B*55:02:01 HLA-B*55:12 HLA-B*55:24 HLA-B*55:48 HLA-B*56:01:01 HLA-B*56:03 HLA-B*56:04 HLA-B*57:01:01 HLA-B*57:03:01 HLA-B*57:06 HLA-B*57:11 HLA-B*57:29 HLA-B*58:01:01 HLA-B*58:31N HLA-B*59:01:01:01 HLA-B*59:01:01:02 HLA-B*67:01:01 HLA-B*67:01:02 HLA-B*67:02 HLA-B*73:01 HLA-B*78:01:01 HLA-B*81:01 HLA-B*82:02:01 HLA-C*01:02:01 HLA-C*01:02:11 HLA-C*01:02:29 HLA-C*01:02:30 HLA-C*01:03 HLA-C*01:06 HLA-C*01:08 HLA-C*01:14 HLA-C*01:21 HLA-C*01:30 HLA-C*01:40 HLA-C*02:02:02:01 HLA-C*02:02:02:02 HLA-C*02:10 HLA-C*02:11 HLA-C*02:16:02 HLA-C*02:69 HLA-C*02:85 HLA-C*02:86 HLA-C*02:87 HLA-C*03:02:01 HLA-C*03:02:02:01 HLA-C*03:02:02:02 HLA-C*03:02:02:03 HLA-C*03:03:01 HLA-C*03:04:01:01 HLA-C*03:04:01:02 HLA-C*03:04:02 HLA-C*03:04:04 HLA-C*03:05 HLA-C*03:06 HLA-C*03:100 HLA-C*03:13:01 HLA-C*03:20N HLA-C*03:219 HLA-C*03:261 HLA-C*03:40:01 HLA-C*03:41:02 HLA-C*03:46 HLA-C*03:61 HLA-C*04:01:01:01 HLA-C*04:01:01:02 HLA-C*04:01:01:03 HLA-C*04:01:01:04 HLA-C*04:01:01:05 HLA-C*04:01:62 HLA-C*04:03:01 HLA-C*04:06 HLA-C*04:09N HLA-C*04:128 HLA-C*04:161 HLA-C*04:177 HLA-C*04:70 HLA-C*04:71 HLA-C*05:01:01:01 HLA-C*05:01:01:02 HLA-C*05:08 HLA-C*05:09:01 HLA-C*05:93 HLA-C*06:02:01:01 HLA-C*06:02:01:02 HLA-C*06:02:01:03 HLA-C*06:23 HLA-C*06:24 HLA-C*06:46N HLA-C*07:01:01:01 HLA-C*07:01:01:02 HLA-C*07:01:02 HLA-C*07:01:19 HLA-C*07:01:27 HLA-C*07:01:45 HLA-C*07:02:01:01 HLA-C*07:02:01:02 HLA-C*07:02:01:03 HLA-C*07:02:01:04 HLA-C*07:02:01:05 HLA-C*07:02:05 HLA-C*07:02:06 HLA-C*07:02:64 HLA-C*07:04:01 HLA-C*07:04:02 HLA-C*07:06 HLA-C*07:149 HLA-C*07:18 HLA-C*07:19 HLA-C*07:26 HLA-C*07:30 HLA-C*07:32N HLA-C*07:384 HLA-C*07:385 HLA-C*07:386 HLA-C*07:391 HLA-C*07:392 HLA-C*07:49 HLA-C*07:56:02 HLA-C*07:66 HLA-C*07:67 HLA-C*08:01:01 HLA-C*08:01:03 HLA-C*08:02:01:01 HLA-C*08:02:01:02 HLA-C*08:03:01 HLA-C*08:04:01 HLA-C*08:112 HLA-C*08:20 HLA-C*08:21 HLA-C*08:22 HLA-C*08:24 HLA-C*08:27 HLA-C*08:36N HLA-C*08:40 HLA-C*08:41 HLA-C*08:62 HLA-C*12:02:02 HLA-C*12:03:01:01 HLA-C*12:03:01:02 HLA-C*12:08 HLA-C*12:13 HLA-C*12:19 HLA-C*12:22 HLA-C*12:99 HLA-C*14:02:01 HLA-C*14:03 HLA-C*14:21N HLA-C*14:23 HLA-C*15:02:01 HLA-C*15:05:01 HLA-C*15:05:02 HLA-C*15:13 HLA-C*15:16 HLA-C*15:17 HLA-C*15:96Q HLA-C*16:01:01 HLA-C*16:02:01 HLA-C*16:04:01 HLA-C*17:01:01:01 HLA-C*17:01:01:02 HLA-C*17:01:01:03 HLA-C*17:03 HLA-C*18:01 HLA-DQA1*01:01:02 HLA-DQA1*01:02:01:01 HLA-DQA1*01:02:01:02 HLA-DQA1*01:02:01:03 HLA-DQA1*01:02:01:04 HLA-DQA1*01:03:01:01 HLA-DQA1*01:03:01:02 HLA-DQA1*01:04:01:01 HLA-DQA1*01:04:01:02 HLA-DQA1*01:05:01 HLA-DQA1*01:07 HLA-DQA1*01:10 HLA-DQA1*01:11 HLA-DQA1*02:01 HLA-DQA1*03:01:01 HLA-DQA1*03:02 HLA-DQA1*03:03:01 HLA-DQA1*04:01:02:01 HLA-DQA1*04:01:02:02 HLA-DQA1*04:02 HLA-DQA1*05:01:01:01 HLA-DQA1*05:01:01:02 HLA-DQA1*05:03 HLA-DQA1*05:05:01:01 HLA-DQA1*05:05:01:02 HLA-DQA1*05:05:01:03 HLA-DQA1*05:11 HLA-DQA1*06:01:01 HLA-DQB1*02:01:01 HLA-DQB1*02:02:01 HLA-DQB1*03:01:01:01 HLA-DQB1*03:01:01:02 HLA-DQB1*03:01:01:03 HLA-DQB1*03:02:01 HLA-DQB1*03:03:02:01 HLA-DQB1*03:03:02:02 HLA-DQB1*03:03:02:03 HLA-DQB1*03:05:01 HLA-DQB1*05:01:01:01 HLA-DQB1*05:01:01:02 HLA-DQB1*05:03:01:01 HLA-DQB1*05:03:01:02 HLA-DQB1*06:01:01 HLA-DQB1*06:02:01 HLA-DQB1*06:03:01 HLA-DQB1*06:09:01 HLA-DRB1*01:01:01 HLA-DRB1*01:02:01 HLA-DRB1*03:01:01:01 HLA-DRB1*03:01:01:02 HLA-DRB1*04:03:01 HLA-DRB1*07:01:01:01 HLA-DRB1*07:01:01:02 HLA-DRB1*08:03:02 HLA-DRB1*09:21 HLA-DRB1*10:01:01 HLA-DRB1*11:01:01 HLA-DRB1*11:01:02 HLA-DRB1*11:04:01 HLA-DRB1*12:01:01 HLA-DRB1*12:17 HLA-DRB1*13:01:01 HLA-DRB1*13:02:01 HLA-DRB1*14:05:01 HLA-DRB1*14:54:01 HLA-DRB1*15:01:01:01 HLA-DRB1*15:01:01:02 HLA-DRB1*15:01:01:03 HLA-DRB1*15:01:01:04 HLA-DRB1*15:02:01 HLA-DRB1*15:03:01:01 HLA-DRB1*15:03:01:02 HLA-DRB1*16:02:01
It seems like many of the GENCODE names are not in the decoys_alts.p
file, and vice versa.
An aside...
By the way, you might want to consider using .txt
or .json
files instead of pickle files. This should help users and developers to more quickly discover how things are setup just by using a text editor, instead of opening python and running something = pickle.load(open('file.p', 'rb'))
.
This also has another benefit: if you decide someday to change the decoys_alts.p
file, git will not show what lines changed, because it is a binary file. If it were a .txt
file instead, we could see exactly what names were added or removed at any commit.
I'm calling genotypes for these genes:
"A" "B" "C" "DPB1" "DQA1" "DQB1" "DRB1"
I have 336 total genotype calls (24 samples 2 alleles 7 genes).
I have two runs (GENCODE with alt, UCSC without alt).
Every sample has discrepancies.
For complete matches between the two runs: 83 matches and 146 mismatches.
For 2-digit alleles: 152 matches and 133 mismatches.
I was careful to account for the random order of paternal and maternal alleles by sorting the alleles before checking for a match.
I did not use the unmapped option that you mentioned, so I wonder if that might help (arcasHLA extract --unmapped
).
Thanks for following up on this issue. Regarding the ALT sequences, you raise a good point, namely that many of GENCODE names are not currently in the decoys_alts.p file. We can update the decoys_alts.p file to contain the chr6-specific ALT sequences from GENCODE -- that is an omission on our part. Indeed, it should improve the concordance between your runs to remove the chr6-specific ALT sequences (the ALT sequences from other chromosomes, I suspect, will have a smaller impact). In light of this, my recommendation would be to run mapping first without ALT sequences, at the very least without the chr6 ALT sequences specifically (and to include the --unmapped flag in the extract step). Does that significantly increase the concordance between your runs?
Another consideration here: is the mismatch rate similar for class I and class II genes? What is the typical coverage in your data for class I genes and class II genes (which are not constitutively expressed in every cell)? Do you have low RIN (RNA integrity number) samples? Are you using pair-end or single-end? The tissue of origin, as well as these technical features of your sample runs, might explain why the HLA calling can yield differing results when you change your reference with/without ALT sequences that were not previously included in our decoys_alts.p file. Especially for HLA class II genes, which may only have little coverage overall, making the genotyping for those loci very sensitive to changes in the input reads.
(Hi Kamil, nice to meet you. I've been using your very useful snakemake tutorial and a couple more clicks led me here...) If I may chime in (you probably already know this...), I think that your very interesting problem with the chr6 HLA/MHC loci is caused by the genome biology and evolution of these sequences. There is a lot of diversity in the MHC loci in humans, which is evolutionarily advantageous. (Homogeneous banana populations are essentially clonal and are susceptible to being wiped out by a single virus, for example).
If you include the ALT assemblies, then for a given sample the reads may align to the ALT version that most closely matches that particular sample. (Maybe to two ALT seqs, since we are diploid?). You might want to pull the genotypes for that sample from those alignments, rather than the consensus chr6 location. (P.S., I bet the GENCODE and UCSC main chr seqs are identical). Some reads may still align to chr6, and you may even get genotype calls. The reads that align to ALT may be homozygous reference (with respect to ALT), but that might be homozygous non-reference if they'd aligned to chr6.
If you align vs chr6 alone, then all the alternate reads will be forced to align on the full chr6 sequence, yet they will differ from it, and generate a lot of SNP calls.
So it seems to me that the task is more difficult that simple variant calling ...... you need to determine which ALT, or haplotype block(s) that you have present in each sample, and then whether or not the sample has any SNPs on top of that.... or am I being naiive? Sorry if all this is obvious, and you'd already considered it!
Thanks for the comments, it seems like getting confident genotype calls might be a bit more challenging than I expected. My next step will be to visualize the read pileups for each gene to assess if there is enough data to support any calls.
Some day, I might try the https://github.com/lkuchenb/MultiHLA pipeline...
Could I please ask if you might be willing to discuss a few questions?
chr6
in each BAM file, I think the answer is "yes.")I tried two different options.
Run 1: GENCODE
I downloaded this sequence:
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/GRCh38.p13.genome.fa.gz
Then, I mapped reads with STAR and ran arcasHLA on the BAMs.
Run 2: Filtered UCSC
I downloaded this sequence:
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
First, I discarded all sequences from the fasta file with names that are not in the list
chr1-23,X,Y
.Then, I mapped reads with STAR and ran arcasHLA on the BAMs.
Results
arcasHLA called a greater number of genotypes in the UCSC run than the GENCODE run. I don't know what the true genotypes might be. Many of the genotypes do not match between the two runs.
I think the main reason for the difference between the two runs is whether or not we include the "alternate" or "patch" or "scaffold" sequences. For the GENCODE run, the alternate sequences were included in the read-mapping step. For the UCSC run, the alternate sequences were not included.
I haven't tested to see if the chromosome sequences (
chr1-23,X,Y
) are identical between GENCODE and UCSC, but I might guess that they are very similar or identical.