Closed mahafarhat closed 5 years ago
This item will need some more specification:
Here is the written response for future reference:
1- Drop down box with user first selecting drug, then gene-locus, and then mutation. I will send a list of drugs/gene-locus/mutations that user can select from. Ideally user can narrow down the list by typing in a few characters to allow filtering. Each mutation by definition occurs in one gene. The user can select more than one mutation from more than one gene. Please add a side disclaimer stating for this stating the following: "Please select one or more mutations from this list. Please note that if a mutation is not selected the model will assume that it was tested for/sequenced and is not present" <note even though we talked about not including drug, I think it will be simpler to include the additional drug category>
2- We need to reformat the predict page to accommodate this new feature. First move the following text currently on the side of the predict page: "FASTQ files Please label paired end FASTQ files as follows ... with your desired isolate or strain name"
into a linked popup or hover box (or another page whatever is more user friendly/visible) from the FASTQ word in the following text (currenlty on the left hand side)
"To speed up the upload process please:
Also change the right hand text: "VCF file VCF stands for Variant Call Format and is further described in this PDF"
to
"Variant Call Format (VCF) is further described in this PDF"
and imbed this similarly in a link/popup/hover box similar to FASTQ under the VCF word in the same sentence.
In place of this chunk of text on the right hand please place the drop downs that I describe above. We will need to tweak the rest of the text on the page to accomodate this, but can do this a little later.
3- I will try to attach this below in a word document
4- let's forget about the text box for pasting
5- The data will actually enter as a positive status for the mutations and should be converted to a matrix.csv file with binary status for feeding into the R script (TBpredict.R)
Here is a list of the drugs/genes-loci/mutations. Use the mutation names as is without reformating but add the drug/gene-locus layers for easier searching.
The minimum list of predictive variables by drug ordered by decreasing measure of predictive importance. The three letter abbreviation refers to the drug. Please note that mutation statuses from all drugs need to be combined into one matrix.csv file before this is fed to TBpredict.R (please consult example files to see the format of matrix.csv).
The gene-locus name is the second to last field for most mutations e.g. for mutation "SNP_CN_2155168_CG_katG_S315T", the gene name is "katG".
For P mutations it's the word "promoter" and the last field in the mutation name. e.g. "SNP_P_1673425_CT.15_fabG1.inhA" the gene-locus name is "promoter fabG1.inhA".
For I mutations it's the word "intergenic" and the last field in the mutation name. e.g. "SNP_I_1473637_A.21_rrs.rrl" the gene locus name is "intergenic rrs.rrl"
For INS and DEL mutations, the gene-locus name is the last field in the name (i.e. after the last underscore)
INH
1 SNP_CN_2155168_CG_katG_S315T
2 SNP_P_1673425_CT.15_fabG1.inhA
3 SNP_CN_4247429_AG_embB_M306V
4 SNP_CN_4247431_GA_embB_M306I
5 SNP_CN_1674481_TG_inhA_S94A
6 SNP_CN_4247431_GC_embB_M306I
7 SNP_CN_2155168_CT_katG_S315N
8 SNP_CN_409569_GA_iniB_A70T
9 SNP_CN_4247730_GC_embB_G406A
10 SNP_P_1673423_GT.17_fabG1.inhA
11 SNP_CN_4247729_GA_embB_G406S
12 SNP_CN_4247402_TG_embB_S297A
13 SNP_CN_2518919_GA_kasA_G269S
14 SNP_CN_2726338_TG_ahpC_V49G
15 SNP_P_4243221_CT.12_embA.embB
16 SNP_P_1673432_TC.8_fabG1.inhA
17 SNP_P_1673432_TG.8_fabG1.inhA
18 SNP_CN_2155167_GT_katG_S315R
RIF
1 SNP_CN_761155_CT_rpoB_S450L
2 SNP_CN_761110_AT_rpoB_D435V
3 SNP_CN_761139_CT_rpoB_H445Y
4 SNP_CN_761140_AG_rpoB_H445R
5 SNP_CN_761140_AT_rpoB_H445L
6 SNP_CN_761155_CG_rpoB_S450W
7 SNP_CN_761139_CG_rpoB_H445D
8 SNP_CN_761277_AT_rpoB_I491F
9 SNP_CN_760314_GT_rpoB_V170F
10 SNP_CN_761109_GT_rpoB_D435Y
11 SNP_CN_761161_TC_rpoB_L452P
12 SNP_CN_761102_AC_rpoB_Q432H
13 SNP_CN_761095_TC_rpoB_L430P
14 SNP_CN_761155_CA_rpoB_S450.
PZA
1 SNP_CN_2289090_TC_pncA_H51R
2 SNP_P_2289252_TC.30_pncA
3 SNP_CN_2289070_AG_pncA_F58L
4 SNP_CN_2288883_AG_pncA_L120P
5 SNP_CN_2289213_TG_pncA_Q10P
6 SNP_CN_2288839_TG_pncA_T135P
7 SNP_CN_2289081_GA_pncA_P54L
8 SNP_CN_2288883_AC_pncA_L120R
9 INS_F_2288725_i516C_pncA
10 SNP_CN_2289016_TG_pncA_T76P
11 SNP_CN_2288953_CT_pncA_G97S
12 SNP_CN_2288933_GC_pncA_Y103.
13 SNP_CN_2288818_TC_pncA_T142A
14 SNP_CN_2288847_CT_pncA_G132D
15 SNP_CN_2289212_CG_pncA_Q10H
16 SNP_CN_2288841_GA_pncA_A134V
17 INS_F_2288851_i390C_pncA
18 INS_F_2288887_i354A_pncA
19 SNP_CN_2288848_CT_pncA_G132S
20 SNP_P_2289245_TA.37_pncA
21 SNP_CN_2289207_TC_pncA_D12G
22 SNP_CN_2288820_TG_pncA_Q141P
23 SNP_CN_2288704_CA_pncA_V180F
24 SNP_CN_2288805_GT_pncA_A146E
25 SNP_CN_2289180_AC_pncA_V21G
26 SNP_CN_2288973_AG_pncA_I90T
27 INS_F_2288851_i390CC_pncA
28 SNP_CN_2289216_AC_pncA_V9G
29 SNP_CN_2289072_TA_pncA_H57L
30 SNP_CN_2288887_AC_pncA_W119G
31 SNP_CN_2289097_CT_pncA_D49N
32 SNP_CN_2288805_GA_pncA_A146V
33 DEL_F_2288939_d302TCCGGTGTAG_pncA
34 SNP_CN_2288988_AG_pncA_L85P
35 SNP_CN_2289207_TG_pncA_D12A
36 SNP_CN_2289228_AG_pncA_I5T
37 SNP_CN_2289220_CT_pncA_D8N
38 DEL_F_2289069_d172A_pncA_F58L
39 DEL_N_2288942_d299GGTGTA_pncA
40 SNP_CN_2289015_GA_pncA_T76I
41 DEL_F_2288776_d465GCACCCTG_pncA
42 SNP_CN_2288925_AG_pncA_F106S
43 SNP_CN_2288835_TC_pncA_D136G
44 SNP_CN_2289040_AG_pncA_W68R
45 SNP_CN_2289099_TG_pncA_K48T
46 SNP_CN_2289214_GA_pncA_Q10.
47 SNP_CN_2288944_TG_pncA_T100P
48 INS_F_2288825_i416C_pncA
49 SNP_CN_2289042_GT_pncA_S67.
50 SNP_CN_2288826_AG_pncA_V139A
51 SNP_CN_2288878_GA_pncA_Q122.
52 SNP_CN_2288697_AC_pncA_L182W
53 SNP_CN_2289073_GA_pncA_H57Y
54 SNP_CN_2289150_AC_pncA_I31S
55 SNP_CN_2288727_AG_pncA_L172P
56 SNP_CN_2288919_CT_pncA_G108E
57 SNP_CN_2288935_AG_pncA_Y103H
58 INS_F_2288835_i406T_pncA
59 SNP_CN_2288952_CT_pncA_G97D
60 SNP_CN_2288697_AG_pncA_L182S
61 SNP_CN_2288853_AT_pncA_V130E
62 SNP_CN_2288730_GA_pncA_A171V
63 SNP_CN_2288775_AG_pncA_L156P
64 SNP_CN_2288850_AC_pncA_V131G
65 INS_F_2289009_i232C_pncA_G78G
66 INS_F_2289050_i191T_pncA_Y64.
67 SNP_CN_2288964_AC_pncA_V93G
68 SNP_CN_2288853_AC_pncA_V130G
69 DEL_F_2288697_d544AACT_pncA
70 SNP_CN_2289009_CA_pncA_G78V
71 SNP_CN_2289043_AG_pncA_S67P
72 SNP_CN_2288938_CG_pncA_A102P
73 SNP_P_2289252_TG.30_pncA
74 SNP_CN_2289073_GC_pncA_H57D
75 SNP_CN_2289206_GC_pncA_D12E
76 DEL_F_2289060_d181GTGCCGGA_pncA
77 SNP_CN_2289202_AG_pncA_C14R
78 SNP_CN_2289050_AT_pncA_Y64.
79 SNP_CN_2289046_AG_pncA_S66P
80 SNP_CN_2288784_GT_pncA_T153N
81 SNP_CN_2289037_GA_pncA_P69S
82 SNP_CN_2288718_AC_pncA_M175R
83 SNP_CN_2289042_GC_pncA_S67W
84 SNP_CN_2288956_TG_pncA_K96Q
85 SNP_CN_2289142_AC_pncA_Y34D
86 SNP_CN_2288844_AG_pncA_I133T
87 SNP_CN_2289040_AC_pncA_W68G
88 SNP_CN_2289054_TG_pncA_D63A
89 SNP_CN_2289090_TG_pncA_H51P
90 SNP_CN_2289186_AG_pncA_L19P
91 SNP_CN_2288826_AC_pncA_V139G
92 SNP_CN_2288818_TG_pncA_T142P
93 SNP_CN_2288817_GA_pncA_T142M
94 SNP_CN_2289219_TC_pncA_D8G
95 SNP_CN_2289072_TC_pncA_H57R
96 SNP_CN_2289028_AG_pncA_C72R
97 INS_F_2288942_i299T_pncA
98 DEL_F_2288923_d318C_pncA
99 SNP_CN_2288742_GA_pncA_T167I
100 SNP_CN_2289095_GC_pncA_D49E
101 SNP_CN_2288956_TC_pncA_K96E
102 SNP_CN_2288703_AC_pncA_V180G
103 SNP_CN_2289069_AC_pncA_F58C
104 SNP_CN_2288955_TG_pncA_K96T
105 SNP_CN_2288764_TC_pncA_T160A
106 SNP_P_2289251_AC.31_pncA
107 SNP_CN_2288696_CA_pncA_L182F
108 SNP_CN_2288778_AC_pncA_V155G
109 SNP_CN_2289103_TC_pncA_T47A
110 SNP_CN_2288943_GA_pncA_T100I
111 SNP_CN_2288718_AG_pncA_M175T
112 SNP_CN_2289030_TC_pncA_H71R
113 SNP_CN_2289162_AG_pncA_L27P
114 SNP_CN_2289030_TG_pncA_H71P
115 SNP_CN_2288827_CT_pncA_V139M
116 SNP_CN_2289231_AG_pncA_L4S
117 SNP_CN_2289213_TC_pncA_Q10R
118 SNP_CN_2288965_CA_pncA_V93L
119 SNP_CN_2289001_AC_pncA_F81V
120 SNP_CN_2289054_TC_pncA_D63G
121 SNP_CN_2288766_AC_pncA_L159R
122 SNP_CN_2288869_CA_pncA_V125F
123 SNP_CN_2289091_GA_pncA_H51Y
124 SNP_CN_2288859_AC_pncA_V128G
EMB
1 SNP_CN_4247429_AG_embB_M306V
2 SNP_CN_4247431_GA_embB_M306I
3 SNP_CN_4247431_GC_embB_M306I
4 SNP_CN_4247730_GC_embB_G406A
5 SNP_CN_4248003_AG_embB_Q497R
6 SNP_CN_4249518_AG_embB_H1002R
7 SNP_CN_409569_GA_iniB_A70T
8 SNP_CN_4247729_GA_embB_G406S
9 SNP_CN_4247431_GT_embB_M306I
10 SNP_CN_4247429_AC_embB_M306L
11 SNP_P_4243222_CA.11_embA.embB
12 SNP_CN_4247574_AC_embB_D354A
13 SNP_CN_4247495_GT_embB_D328Y
14 SNP_CN_4249583_GA_embB_D1024N
15 SNP_CN_4243392_AG_embA_N54D
16 SNP_P_4243225_CT.8_embA.embB
17 SNP_CN_4242182_GT_embC_A774S
18 SNP_CN_4247729_GT_embB_G406C
STR
1 SNP_CN_781687_AG_rpsL_K43R
2 SNP_N_1472359_A514C_rrs
3 SNP_CN_781822_AC_rpsL_K88T
4 SNP_N_1473246_A1401G_rrs
5 SNP_CN_781822_AG_rpsL_K88R
6 SNP_CN_4407809_CA_gid_D132Y
7 SNP_N_1472358_C513T_rrs
8 SNP_CN_4407927_TG_gid_E92D
9 SNP_N_1472751_A906G_rrs
10 SNP_CN_4407934_AC_gid_L90R
11 SNP_N_1472362_C517T_rrs
12 SNP_N_1472753_A908C_rrs
13 SNP_CN_781822_AT_rpsL_K88M
14 SNP_CN_4407832_AG_gid_V124A
15 SNP_CN_4408091_GT_gid_P38T
16 SNP_I_1473637_A.21_rrs.rrl
17 SNP_CN_4408094_CT_gid_G37R
18 DEL_F_4407640_d562A_gid
19 SNP_N_1473109_T1264G_rrs
20 SNP_CN_4407967_AC_gid_L79W
21 SNP_CN_4407967_AG_gid_L79S
22 SNP_CN_4407768_CA_gid_L145F
23 SNP_CN_4407995_TG_gid_S70R
24 DEL_F_4407852_d350C_gid
25 SNP_N_1473167_T1322G_rrs
26 DEL_F_4408023_d179T_gid
27 DEL_F_4408116_d86G_gid
28 SNP_CN_4408060_TG_gid_H48P
29 SNP_CN_4408138_TC_gid_Y22C
30 SNP_CN_4408064_GA_gid_R47W
31 SNP_CN_4408148_CG_gid_A19P
32 SNP_CN_4407947_GA_gid_L86F
33 SNP_CN_4407916_CA_gid_R96L
34 SNP_CN_4407748_AG_gid_L152S
35 SNP_N_1473343_G1498T_rrs
36 SNP_CN_4407985_CG_gid_G73A
37 SNP_CN_4408102_CT_gid_G34E
ETH
1 SNP_P_1673425_CT.15_fabG1.inhA
2 SNP_CN_4326333_CG_ethA_A381P
3 SNP_CN_4326116_GA_ethA_T453I
4 SNP_CN_1674481_TG_inhA_S94A
5 SNP_CN_4326714_GA_ethA_Q254.
6 SNP_CN_1674263_TC_inhA_I21T
7 SNP_CN_4327416_CA_ethA_A20S
8 DEL_F_4326184_d1289G_ethA
9 SNP_CN_4327380_AC_ethA_Y32D
10 SNP_CN_1674434_TG_inhA_V78G
11 INS_F_4326141_i1332C_ethA
12 SNP_CN_4326600_GA_ethA_R292.
13 SNP_CN_4326713_TG_ethA_Q254P
14 SNP_CN_4326305_GA_ethA_S390F
15 SNP_P_1673423_GT.17_fabG1.inhA
16 INS_F_4326722_i751C_ethA
17 SNP_CN_1673449_AC_fabG1_T4P
18 SNP_CN_4327311_AG_ethA_S55P
19 SNP_CN_4326278_GT_ethA_S399.
20 SNP_CN_4327148_CT_ethA_W109.
KAN
1 SNP_N_1473246_A1401G_rrs
2 SNP_CN_1918745_AGtlyA.269W
CAP
1 SNP_N_1473246_A1401G_rrs
2 SNP_N_1473109_T1264G_rrs
3 SNP_N_1472753_A908C_rrs
4 SNP_N_1473160_G1315A_rrs
5 SNP_N_1473343_G1498T_rrs
AMK
1 SNP_N_1473246_A1401G_rrs
2 SNP_N_1472359_A514C_rrs
CIP
1 SNP_CN_7582_AG_gyrA_D94G
2 SNP_CN_7570_CT_gyrA_A90V
3 SNP_CN_7582_AC_gyrA_D94A
4 SNP_CN_7581_GT_gyrA_D94Y
5 SNP_CN_6735_AC_gyrB_N538T
6 SNP_CN_7572_TC_gyrA_S91P
7 SNP_CN_7566_GA_gyrA_D89N
LEVO
1 SNP_CN_7582_AG_gyrA_D94G
2 SNP_CN_7570_CT_gyrA_A90V
3 SNP_CN_7581_GT_gyrA_D94Y
4 SNP_CN_7581_GA_gyrA_D94N
5 SNP_CN_7582_AC_gyrA_D94A
6 SNP_CN_7572_TC_gyrA_S91P
7 SNP_CN_7563_GT_gyrA_G88C
8 SNP_CN_7566_GA_gyrA_D89N
OFLX
1 SNP_CN_7582_AG_gyrA_D94G
2 SNP_CN_7570_CT_gyrA_A90V
3 SNP_CN_7582_AC_gyrA_D94A
4 SNP_CN_7581_GA_gyrA_D94N
5 SNP_CN_6735_AC_gyrB_N538T
6 SNP_CN_7581_GC_gyrA_D94H
PAS
1 SNP_CN_3073852_TC_thyA_H207R
2 SNP_CN_3074449_AT_thyA_L8Q
3 SNP_CN_3074182_TC_thyA_Q97R
4 SNP_P_3074479_AG.157_thyA
I'm running final consistency checks on the manual data input.
Examine matrix.csv files have 1839 columns, 1125 unique columns.
These data (listed above) only have 239 columns. Is there a process we're missing or are there a lot more mutations that we have not included in the list because they are no important for the prediction?
The prediction still works though, so not sure. Testing continues.
hi martin, yes the mutations a user can select from the drop down are a suset of the total, essentially only the mutations that are important for the prediction. Maha
On Mon, Jun 13, 2016 at 2:32 PM, Martin Owens notifications@github.com wrote:
I'm running final consistency checks on the manual data input.
Examine matrix.csv files have 1839 columns, 1125 unique columns.
These data (listed above) only have 239 columns. Is there a process we're missing or are there a lot more mutations that we have not included in the list because they are no important for the prediction?
The prediction still works though, so not sure. Testing continues.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/IQSS/gentb-site/issues/15#issuecomment-225668593, or mute the thread https://github.com/notifications/unsubscribe/AHwQVUlIMFEHvYUirFP2osD3PTjWNcawks5qLaJSgaJpZM4IN2Vp .
@doctormo I'm reopening this issue for the manual input predict option. It has the mutations we need. Please restore this into the drug-gene lookup tables. I'm assuming the code itself is still intact.
Maha,
The names for some of these mutations have changed, for example SNP_CN_3073852_TC_thyA_H207R
is now SNP_CN_3073852_T620C_H207R_thyA
in the database.
I've managed to add all the rrl and rrs mutations, but everything else fails.
I thought we wrote code to deal with the naming discrepancy previously (matching on third field 3073852 in the example you give and the TC (before and after the 620), can you double check please. If not I will need to update the naming of these mutation with the new convention.
On Wed, Jun 5, 2019 at 10:07 AM Martin Owens notifications@github.com wrote:
Maha,
The names for some of these mutations have changed, for example SNP_CN_3073852_TC_thyA_H207R is now SNP_CN_3073852_T620C_H207R_thyA in the database.
I've managed to add all the rrl and rrs mutations, but everything else fails.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/farhat-lab/gentb-site/issues/15?email_source=notifications&email_token=AB6BAVJL6HU5DJJVSSOVBO3PY7CDFA5CNFSM4CBXMVU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW72FIQ#issuecomment-499098274, or mute the thread https://github.com/notifications/unsubscribe-auth/AB6BAVO6BRMOWPB34ST6L73PY7CDFANCNFSM4CBXMVUQ .
I have parsing which can match both formats. But because these names are being submitted to the predict script, I need to make sure that the mutation names aren't going to upset the script?
The simplest way is to present the mutations to the user in the format provided above in this comment trail. This is the same format at what's used in the R script. The generate_matrix.py code in our WGS pipeline converts between the current vcf/var format and this older format used by the R script.
We don't have a system in place to do this, the mutation names have mutated and the database doesn't have special provision for "mutation names as they appear in the predict script". I've matched up the mutations we have so far, so we may have to sit down and go through the mutations provision, I think we could do it better than we currently do.
Ok I'm attaching an example matrix input file (csv format) for the TB predict. I think it is the updated format after all. Let's discuss on Friday.
https://gentb.hms.harvard.edu/tb/media/data/tbdata_00000317/Peru3419_matrix.csv
The new deployed list is now a fixed json file, which can be edited directly by anyone.
This is an alternative form of data input instead of upload a file, the user can select one or more mutations from a list, or copy paste a list of mutations into a box. This data is then the input for the pipeline and would enter as a var file. This will allow users who don't have large sequence files including clinicians or laboratory technicians to use the prediction pipeline without having any sequence data on hand.