farhat-lab / gentb-site

The genTB project, the Django site, variant calling and prediciton pipeline, and mapping pipeline with hooks to two ravens
https://gentb.hms.harvard.edu
Other
8 stars 11 forks source link

Add drop down option for selecting mutations for prediciton #15

Closed mahafarhat closed 5 years ago

mahafarhat commented 8 years ago

This is an alternative form of data input instead of upload a file, the user can select one or more mutations from a list, or copy paste a list of mutations into a box. This data is then the input for the pipeline and would enter as a var file. This will allow users who don't have large sequence files including clinicians or laboratory technicians to use the prediction pipeline without having any sequence data on hand.

doctormo commented 8 years ago

This item will need some more specification:

  1. How will the user choose manual input
  2. What will the page look like?
  3. What should be in the drop down list?
  4. What should be acceptable in a text box for pasting? (csv? of what values)
  5. What is the format of the var file.
mahafarhat commented 8 years ago

Here is the written response for future reference:

1- Drop down box with user first selecting drug, then gene-locus, and then mutation. I will send a list of drugs/gene-locus/mutations that user can select from. Ideally user can narrow down the list by typing in a few characters to allow filtering. Each mutation by definition occurs in one gene. The user can select more than one mutation from more than one gene. Please add a side disclaimer stating for this stating the following: "Please select one or more mutations from this list. Please note that if a mutation is not selected the model will assume that it was tested for/sequenced and is not present" <note even though we talked about not including drug, I think it will be simpler to include the additional drug category>

2- We need to reformat the predict page to accommodate this new feature. First move the following text currently on the side of the predict page: "FASTQ files Please label paired end FASTQ files as follows ... with your desired isolate or strain name"

into a linked popup or hover box (or another page whatever is more user friendly/visible) from the FASTQ word in the following text (currenlty on the left hand side)

"To speed up the upload process please:

  1. Place the FASTQ or VCF files (1 or many in a dropbox folder)...."

Also change the right hand text: "VCF file VCF stands for Variant Call Format and is further described in this PDF"

to

"Variant Call Format (VCF) is further described in this PDF"

and imbed this similarly in a link/popup/hover box similar to FASTQ under the VCF word in the same sentence.

In place of this chunk of text on the right hand please place the drop downs that I describe above. We will need to tweak the rest of the text on the page to accomodate this, but can do this a little later.

3- I will try to attach this below in a word document

4- let's forget about the text box for pasting

5- The data will actually enter as a positive status for the mutations and should be converted to a matrix.csv file with binary status for feeding into the R script (TBpredict.R)

mahafarhat commented 8 years ago

Here is a list of the drugs/genes-loci/mutations. Use the mutation names as is without reformating but add the drug/gene-locus layers for easier searching.

The minimum list of predictive variables by drug ordered by decreasing measure of predictive importance. The three letter abbreviation refers to the drug. Please note that mutation statuses from all drugs need to be combined into one matrix.csv file before this is fed to TBpredict.R (please consult example files to see the format of matrix.csv).

The gene-locus name is the second to last field for most mutations e.g. for mutation "SNP_CN_2155168_CG_katG_S315T", the gene name is "katG".

For P mutations it's the word "promoter" and the last field in the mutation name. e.g. "SNP_P_1673425_CT.15_fabG1.inhA" the gene-locus name is "promoter fabG1.inhA".

For I mutations it's the word "intergenic" and the last field in the mutation name. e.g. "SNP_I_1473637_A.21_rrs.rrl" the gene locus name is "intergenic rrs.rrl"

For INS and DEL mutations, the gene-locus name is the last field in the name (i.e. after the last underscore)

INH 1 SNP_CN_2155168_CG_katG_S315T 2 SNP_P_1673425_CT.15_fabG1.inhA 3 SNP_CN_4247429_AG_embB_M306V 4 SNP_CN_4247431_GA_embB_M306I 5 SNP_CN_1674481_TG_inhA_S94A 6 SNP_CN_4247431_GC_embB_M306I 7 SNP_CN_2155168_CT_katG_S315N 8 SNP_CN_409569_GA_iniB_A70T 9 SNP_CN_4247730_GC_embB_G406A 10 SNP_P_1673423_GT.17_fabG1.inhA 11 SNP_CN_4247729_GA_embB_G406S 12 SNP_CN_4247402_TG_embB_S297A 13 SNP_CN_2518919_GA_kasA_G269S 14 SNP_CN_2726338_TG_ahpC_V49G 15 SNP_P_4243221_CT.12_embA.embB 16 SNP_P_1673432_TC.8_fabG1.inhA 17 SNP_P_1673432_TG.8_fabG1.inhA 18 SNP_CN_2155167_GT_katG_S315R RIF 1 SNP_CN_761155_CT_rpoB_S450L 2 SNP_CN_761110_AT_rpoB_D435V 3 SNP_CN_761139_CT_rpoB_H445Y 4 SNP_CN_761140_AG_rpoB_H445R 5 SNP_CN_761140_AT_rpoB_H445L 6 SNP_CN_761155_CG_rpoB_S450W 7 SNP_CN_761139_CG_rpoB_H445D 8 SNP_CN_761277_AT_rpoB_I491F 9 SNP_CN_760314_GT_rpoB_V170F 10 SNP_CN_761109_GT_rpoB_D435Y 11 SNP_CN_761161_TC_rpoB_L452P 12 SNP_CN_761102_AC_rpoB_Q432H 13 SNP_CN_761095_TC_rpoB_L430P 14 SNP_CN_761155_CA_rpoB_S450. PZA 1 SNP_CN_2289090_TC_pncA_H51R 2 SNP_P_2289252_TC.30_pncA 3 SNP_CN_2289070_AG_pncA_F58L 4 SNP_CN_2288883_AG_pncA_L120P 5 SNP_CN_2289213_TG_pncA_Q10P 6 SNP_CN_2288839_TG_pncA_T135P 7 SNP_CN_2289081_GA_pncA_P54L 8 SNP_CN_2288883_AC_pncA_L120R 9 INS_F_2288725_i516C_pncA 10 SNP_CN_2289016_TG_pncA_T76P 11 SNP_CN_2288953_CT_pncA_G97S 12 SNP_CN_2288933_GC_pncA_Y103. 13 SNP_CN_2288818_TC_pncA_T142A 14 SNP_CN_2288847_CT_pncA_G132D 15 SNP_CN_2289212_CG_pncA_Q10H 16 SNP_CN_2288841_GA_pncA_A134V 17 INS_F_2288851_i390C_pncA 18 INS_F_2288887_i354A_pncA 19 SNP_CN_2288848_CT_pncA_G132S 20 SNP_P_2289245_TA.37_pncA 21 SNP_CN_2289207_TC_pncA_D12G 22 SNP_CN_2288820_TG_pncA_Q141P 23 SNP_CN_2288704_CA_pncA_V180F 24 SNP_CN_2288805_GT_pncA_A146E 25 SNP_CN_2289180_AC_pncA_V21G 26 SNP_CN_2288973_AG_pncA_I90T 27 INS_F_2288851_i390CC_pncA 28 SNP_CN_2289216_AC_pncA_V9G 29 SNP_CN_2289072_TA_pncA_H57L 30 SNP_CN_2288887_AC_pncA_W119G 31 SNP_CN_2289097_CT_pncA_D49N 32 SNP_CN_2288805_GA_pncA_A146V 33 DEL_F_2288939_d302TCCGGTGTAG_pncA 34 SNP_CN_2288988_AG_pncA_L85P 35 SNP_CN_2289207_TG_pncA_D12A 36 SNP_CN_2289228_AG_pncA_I5T 37 SNP_CN_2289220_CT_pncA_D8N 38 DEL_F_2289069_d172A_pncA_F58L 39 DEL_N_2288942_d299GGTGTA_pncA 40 SNP_CN_2289015_GA_pncA_T76I 41 DEL_F_2288776_d465GCACCCTG_pncA 42 SNP_CN_2288925_AG_pncA_F106S 43 SNP_CN_2288835_TC_pncA_D136G 44 SNP_CN_2289040_AG_pncA_W68R 45 SNP_CN_2289099_TG_pncA_K48T 46 SNP_CN_2289214_GA_pncA_Q10. 47 SNP_CN_2288944_TG_pncA_T100P 48 INS_F_2288825_i416C_pncA 49 SNP_CN_2289042_GT_pncA_S67. 50 SNP_CN_2288826_AG_pncA_V139A 51 SNP_CN_2288878_GA_pncA_Q122. 52 SNP_CN_2288697_AC_pncA_L182W 53 SNP_CN_2289073_GA_pncA_H57Y 54 SNP_CN_2289150_AC_pncA_I31S 55 SNP_CN_2288727_AG_pncA_L172P 56 SNP_CN_2288919_CT_pncA_G108E 57 SNP_CN_2288935_AG_pncA_Y103H 58 INS_F_2288835_i406T_pncA 59 SNP_CN_2288952_CT_pncA_G97D 60 SNP_CN_2288697_AG_pncA_L182S 61 SNP_CN_2288853_AT_pncA_V130E 62 SNP_CN_2288730_GA_pncA_A171V 63 SNP_CN_2288775_AG_pncA_L156P 64 SNP_CN_2288850_AC_pncA_V131G 65 INS_F_2289009_i232C_pncA_G78G 66 INS_F_2289050_i191T_pncA_Y64. 67 SNP_CN_2288964_AC_pncA_V93G 68 SNP_CN_2288853_AC_pncA_V130G 69 DEL_F_2288697_d544AACT_pncA 70 SNP_CN_2289009_CA_pncA_G78V 71 SNP_CN_2289043_AG_pncA_S67P 72 SNP_CN_2288938_CG_pncA_A102P 73 SNP_P_2289252_TG.30_pncA 74 SNP_CN_2289073_GC_pncA_H57D 75 SNP_CN_2289206_GC_pncA_D12E 76 DEL_F_2289060_d181GTGCCGGA_pncA 77 SNP_CN_2289202_AG_pncA_C14R 78 SNP_CN_2289050_AT_pncA_Y64. 79 SNP_CN_2289046_AG_pncA_S66P 80 SNP_CN_2288784_GT_pncA_T153N 81 SNP_CN_2289037_GA_pncA_P69S 82 SNP_CN_2288718_AC_pncA_M175R 83 SNP_CN_2289042_GC_pncA_S67W 84 SNP_CN_2288956_TG_pncA_K96Q 85 SNP_CN_2289142_AC_pncA_Y34D 86 SNP_CN_2288844_AG_pncA_I133T 87 SNP_CN_2289040_AC_pncA_W68G 88 SNP_CN_2289054_TG_pncA_D63A 89 SNP_CN_2289090_TG_pncA_H51P 90 SNP_CN_2289186_AG_pncA_L19P 91 SNP_CN_2288826_AC_pncA_V139G 92 SNP_CN_2288818_TG_pncA_T142P 93 SNP_CN_2288817_GA_pncA_T142M 94 SNP_CN_2289219_TC_pncA_D8G 95 SNP_CN_2289072_TC_pncA_H57R 96 SNP_CN_2289028_AG_pncA_C72R 97 INS_F_2288942_i299T_pncA 98 DEL_F_2288923_d318C_pncA 99 SNP_CN_2288742_GA_pncA_T167I 100 SNP_CN_2289095_GC_pncA_D49E 101 SNP_CN_2288956_TC_pncA_K96E 102 SNP_CN_2288703_AC_pncA_V180G 103 SNP_CN_2289069_AC_pncA_F58C 104 SNP_CN_2288955_TG_pncA_K96T 105 SNP_CN_2288764_TC_pncA_T160A 106 SNP_P_2289251_AC.31_pncA 107 SNP_CN_2288696_CA_pncA_L182F 108 SNP_CN_2288778_AC_pncA_V155G 109 SNP_CN_2289103_TC_pncA_T47A 110 SNP_CN_2288943_GA_pncA_T100I 111 SNP_CN_2288718_AG_pncA_M175T 112 SNP_CN_2289030_TC_pncA_H71R 113 SNP_CN_2289162_AG_pncA_L27P 114 SNP_CN_2289030_TG_pncA_H71P 115 SNP_CN_2288827_CT_pncA_V139M 116 SNP_CN_2289231_AG_pncA_L4S 117 SNP_CN_2289213_TC_pncA_Q10R 118 SNP_CN_2288965_CA_pncA_V93L 119 SNP_CN_2289001_AC_pncA_F81V 120 SNP_CN_2289054_TC_pncA_D63G 121 SNP_CN_2288766_AC_pncA_L159R 122 SNP_CN_2288869_CA_pncA_V125F 123 SNP_CN_2289091_GA_pncA_H51Y 124 SNP_CN_2288859_AC_pncA_V128G EMB 1 SNP_CN_4247429_AG_embB_M306V 2 SNP_CN_4247431_GA_embB_M306I 3 SNP_CN_4247431_GC_embB_M306I 4 SNP_CN_4247730_GC_embB_G406A 5 SNP_CN_4248003_AG_embB_Q497R 6 SNP_CN_4249518_AG_embB_H1002R 7 SNP_CN_409569_GA_iniB_A70T 8 SNP_CN_4247729_GA_embB_G406S 9 SNP_CN_4247431_GT_embB_M306I 10 SNP_CN_4247429_AC_embB_M306L 11 SNP_P_4243222_CA.11_embA.embB 12 SNP_CN_4247574_AC_embB_D354A 13 SNP_CN_4247495_GT_embB_D328Y 14 SNP_CN_4249583_GA_embB_D1024N 15 SNP_CN_4243392_AG_embA_N54D 16 SNP_P_4243225_CT.8_embA.embB 17 SNP_CN_4242182_GT_embC_A774S 18 SNP_CN_4247729_GT_embB_G406C STR 1 SNP_CN_781687_AG_rpsL_K43R 2 SNP_N_1472359_A514C_rrs 3 SNP_CN_781822_AC_rpsL_K88T 4 SNP_N_1473246_A1401G_rrs 5 SNP_CN_781822_AG_rpsL_K88R 6 SNP_CN_4407809_CA_gid_D132Y 7 SNP_N_1472358_C513T_rrs 8 SNP_CN_4407927_TG_gid_E92D 9 SNP_N_1472751_A906G_rrs 10 SNP_CN_4407934_AC_gid_L90R 11 SNP_N_1472362_C517T_rrs 12 SNP_N_1472753_A908C_rrs 13 SNP_CN_781822_AT_rpsL_K88M 14 SNP_CN_4407832_AG_gid_V124A 15 SNP_CN_4408091_GT_gid_P38T 16 SNP_I_1473637_A.21_rrs.rrl 17 SNP_CN_4408094_CT_gid_G37R 18 DEL_F_4407640_d562A_gid 19 SNP_N_1473109_T1264G_rrs 20 SNP_CN_4407967_AC_gid_L79W 21 SNP_CN_4407967_AG_gid_L79S 22 SNP_CN_4407768_CA_gid_L145F 23 SNP_CN_4407995_TG_gid_S70R 24 DEL_F_4407852_d350C_gid 25 SNP_N_1473167_T1322G_rrs 26 DEL_F_4408023_d179T_gid 27 DEL_F_4408116_d86G_gid 28 SNP_CN_4408060_TG_gid_H48P 29 SNP_CN_4408138_TC_gid_Y22C 30 SNP_CN_4408064_GA_gid_R47W 31 SNP_CN_4408148_CG_gid_A19P 32 SNP_CN_4407947_GA_gid_L86F 33 SNP_CN_4407916_CA_gid_R96L 34 SNP_CN_4407748_AG_gid_L152S 35 SNP_N_1473343_G1498T_rrs 36 SNP_CN_4407985_CG_gid_G73A 37 SNP_CN_4408102_CT_gid_G34E ETH 1 SNP_P_1673425_CT.15_fabG1.inhA 2 SNP_CN_4326333_CG_ethA_A381P 3 SNP_CN_4326116_GA_ethA_T453I 4 SNP_CN_1674481_TG_inhA_S94A 5 SNP_CN_4326714_GA_ethA_Q254. 6 SNP_CN_1674263_TC_inhA_I21T 7 SNP_CN_4327416_CA_ethA_A20S 8 DEL_F_4326184_d1289G_ethA 9 SNP_CN_4327380_AC_ethA_Y32D 10 SNP_CN_1674434_TG_inhA_V78G 11 INS_F_4326141_i1332C_ethA 12 SNP_CN_4326600_GA_ethA_R292. 13 SNP_CN_4326713_TG_ethA_Q254P 14 SNP_CN_4326305_GA_ethA_S390F 15 SNP_P_1673423_GT.17_fabG1.inhA 16 INS_F_4326722_i751C_ethA 17 SNP_CN_1673449_AC_fabG1_T4P 18 SNP_CN_4327311_AG_ethA_S55P 19 SNP_CN_4326278_GT_ethA_S399. 20 SNP_CN_4327148_CT_ethA_W109. KAN 1 SNP_N_1473246_A1401G_rrs 2 SNP_CN_1918745_AGtlyA.269W CAP 1 SNP_N_1473246_A1401G_rrs 2 SNP_N_1473109_T1264G_rrs 3 SNP_N_1472753_A908C_rrs 4 SNP_N_1473160_G1315A_rrs 5 SNP_N_1473343_G1498T_rrs AMK 1 SNP_N_1473246_A1401G_rrs 2 SNP_N_1472359_A514C_rrs CIP 1 SNP_CN_7582_AG_gyrA_D94G 2 SNP_CN_7570_CT_gyrA_A90V 3 SNP_CN_7582_AC_gyrA_D94A 4 SNP_CN_7581_GT_gyrA_D94Y 5 SNP_CN_6735_AC_gyrB_N538T 6 SNP_CN_7572_TC_gyrA_S91P 7 SNP_CN_7566_GA_gyrA_D89N LEVO
1 SNP_CN_7582_AG_gyrA_D94G 2 SNP_CN_7570_CT_gyrA_A90V 3 SNP_CN_7581_GT_gyrA_D94Y 4 SNP_CN_7581_GA_gyrA_D94N 5 SNP_CN_7582_AC_gyrA_D94A 6 SNP_CN_7572_TC_gyrA_S91P 7 SNP_CN_7563_GT_gyrA_G88C 8 SNP_CN_7566_GA_gyrA_D89N OFLX
1 SNP_CN_7582_AG_gyrA_D94G 2 SNP_CN_7570_CT_gyrA_A90V 3 SNP_CN_7582_AC_gyrA_D94A 4 SNP_CN_7581_GA_gyrA_D94N 5 SNP_CN_6735_AC_gyrB_N538T 6 SNP_CN_7581_GC_gyrA_D94H PAS 1 SNP_CN_3073852_TC_thyA_H207R 2 SNP_CN_3074449_AT_thyA_L8Q 3 SNP_CN_3074182_TC_thyA_Q97R 4 SNP_P_3074479_AG.157_thyA

doctormo commented 8 years ago

I'm running final consistency checks on the manual data input.

Examine matrix.csv files have 1839 columns, 1125 unique columns.

These data (listed above) only have 239 columns. Is there a process we're missing or are there a lot more mutations that we have not included in the list because they are no important for the prediction?

The prediction still works though, so not sure. Testing continues.

mahafarhat commented 8 years ago

hi martin, yes the mutations a user can select from the drop down are a suset of the total, essentially only the mutations that are important for the prediction. Maha

On Mon, Jun 13, 2016 at 2:32 PM, Martin Owens notifications@github.com wrote:

I'm running final consistency checks on the manual data input.

Examine matrix.csv files have 1839 columns, 1125 unique columns.

These data (listed above) only have 239 columns. Is there a process we're missing or are there a lot more mutations that we have not included in the list because they are no important for the prediction?

The prediction still works though, so not sure. Testing continues.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/IQSS/gentb-site/issues/15#issuecomment-225668593, or mute the thread https://github.com/notifications/unsubscribe/AHwQVUlIMFEHvYUirFP2osD3PTjWNcawks5qLaJSgaJpZM4IN2Vp .

mahafarhat commented 5 years ago

@doctormo I'm reopening this issue for the manual input predict option. It has the mutations we need. Please restore this into the drug-gene lookup tables. I'm assuming the code itself is still intact.

doctormo commented 5 years ago

Maha,

The names for some of these mutations have changed, for example SNP_CN_3073852_TC_thyA_H207R is now SNP_CN_3073852_T620C_H207R_thyA in the database.

I've managed to add all the rrl and rrs mutations, but everything else fails.

mahafarhat commented 5 years ago

I thought we wrote code to deal with the naming discrepancy previously (matching on third field 3073852 in the example you give and the TC (before and after the 620), can you double check please. If not I will need to update the naming of these mutation with the new convention.

On Wed, Jun 5, 2019 at 10:07 AM Martin Owens notifications@github.com wrote:

Maha,

The names for some of these mutations have changed, for example SNP_CN_3073852_TC_thyA_H207R is now SNP_CN_3073852_T620C_H207R_thyA in the database.

I've managed to add all the rrl and rrs mutations, but everything else fails.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/farhat-lab/gentb-site/issues/15?email_source=notifications&email_token=AB6BAVJL6HU5DJJVSSOVBO3PY7CDFA5CNFSM4CBXMVU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW72FIQ#issuecomment-499098274, or mute the thread https://github.com/notifications/unsubscribe-auth/AB6BAVO6BRMOWPB34ST6L73PY7CDFANCNFSM4CBXMVUQ .

doctormo commented 5 years ago

I have parsing which can match both formats. But because these names are being submitted to the predict script, I need to make sure that the mutation names aren't going to upset the script?

mahafarhat commented 5 years ago

The simplest way is to present the mutations to the user in the format provided above in this comment trail. This is the same format at what's used in the R script. The generate_matrix.py code in our WGS pipeline converts between the current vcf/var format and this older format used by the R script.

doctormo commented 5 years ago

We don't have a system in place to do this, the mutation names have mutated and the database doesn't have special provision for "mutation names as they appear in the predict script". I've matched up the mutations we have so far, so we may have to sit down and go through the mutations provision, I think we could do it better than we currently do.

mahafarhat commented 5 years ago

Ok I'm attaching an example matrix input file (csv format) for the TB predict. I think it is the updated format after all. Let's discuss on Friday.

https://gentb.hms.harvard.edu/tb/media/data/tbdata_00000317/Peru3419_matrix.csv

doctormo commented 5 years ago

The new deployed list is now a fixed json file, which can be edited directly by anyone.

https://gentb.hms.harvard.edu/tb/static/manual_predict.json