Open sr320 opened 5 years ago
See this tweet https://twitter.com/strnr/status/1107717768378494976?s=12
What do you want the notebook to do that it doesn't already do? A rough outline of how you want it to work?
One thought was getting your GO manipuation into one line I could understand :) see tweet.
Update SPID file - reassess all the different information that we might want to capture from UNIPROT.
Pound with lots of blast files to check edge cases.
Several of us have gene subset we want info on.
I would love a one stop magical notebook.
GO is not necessarily end goal - comprehensive annotation is.
On Mar 20, 2019, 8:52 AM -0700, kubu4 notifications@github.com, wrote:
What do you want the notebook to do that it doesn't already do? A rough outline of how you want it to work? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Not sure if this is clear enough to be helpful, but here's how I spread the GO terms across columns and collapsed them back into one column in R (it's also in lines 74-92 in this code):
sig0.1_pro_GO <- all_sig0.1_pro_logFC_pval[,c("protein_ID","GO_IDs")]
sig0.1_pro_GOid_term <- data.frame()
for (i in 1:nrow(sig0.1_pro_GO)){
sig0.1_pro_GOid_term_row <- data.frame(t(data.frame(strsplit(as.character(sig0.1_pro_GO$GO_IDs[i]),'; ', fixed = TRUE))), stringsAsFactors = FALSE)
sig0.1_pro_GOid_term <- rbind.fill(sig0.1_pro_GOid_term,sig0.1_pro_GOid_term_row) }
sig0.1_pro_GOid_term <- cbind(all_sig0.1_pro_logFC_pval[,"protein_ID"], sig0.1_pro_GOid_term)
str(sig0.1_pro_GOid_term)
STACKED_sig0.1_pro_GOid_term <- tidyr::gather(sig0.1_pro_GOid_term,"protein_ID","GO", 2:ncol(sig0.1_pro_GOid_term))
use case where solution not evident - https://github.com/RobertsLab/resources/issues/618#issuecomment-474937754
and @shellytrigg points hits on https://github.com/RobertsLab/resources/issues/614 how to do everything efficiently in one notebook (R vs bash)
Yes, all in one notebook would be better. I did the uniprot mapping/reformatting the mapped output using this jupyter notebook and the used the R script linked above to get the GO terms in usable format. All documented here: https://shellytrigg.github.io/40th-post/ .
I guess I need a more defined explanation of what the following would be:
use case where solution not evident - #618 (comment)
Isn't the solution the same as always? Perform BLAST to get Uniprot accession numbers, then join with Uniprot accession numbers and associated GO terms?
Update SPID file
It seems like this is the real solution to the whole thing. Get a massive SPID file with all available columns (e.g. SP IDs, taxonomic lineage, protein domains, protein names, GO terms and subsets, etc) and use that for joins. Then, user can select which columns that want to use for annotations.
It seems like this is the real solution to the whole thing. Get a massive SPID file with all available columns (e.g. SP IDs, taxonomic lineage, protein domains, protein names, GO terms and subsets, etc) and use that for joins. Then, user can select which columns that want to use for annotations.
yes
action item is getting said massive SPID file...
and providing some guidance on likely desired outputs - (this answer could come from group) but as a test case @kubu4 will be finding gene set on Cv with high coverage, low coverage, high methylation, low methylation.... and we want to know everything possible about those genes (proteins)..
Here's the "default" layout of the entire SPID database file (listing a single entry below). This clearly has everything we'd want, however, parsing it won't be straightforward (for us, anyway) and it will be extremely difficult to "flatten" this file further, since each ID entry will have a varying number of features.
We might just have to devise a way to work with this format.
The web interface to UniProt does actually provide a way to download a flattened version of the database, but you have to select which columns you want:
https://www.uniprot.org/uniprot/?query=reviewed:yes#customize-columns
There are a TON of available columns. Maybe we can decide on a subset of those columns and then just download a nicely flattened file with our desired columns?
Here's an entry from the SPID database:
ID 001R_FRG3G Reviewed; 256 AA.
AC Q6GZX4;
DT 28-JUN-2011, integrated into UniProtKB/Swiss-Prot.
DT 19-JUL-2004, sequence version 1.
DT 16-JAN-2019, entry version 35.
DE RecName: Full=Putative transcription factor 001R;
GN ORFNames=FV3-001R;
OS Frog virus 3 (isolate Goorha) (FV-3).
OC Viruses; dsDNA viruses, no RNA stage; Iridoviridae; Alphairidovirinae;
OC Ranavirus.
OX NCBI_TaxID=654924;
OH NCBI_TaxID=8295; Ambystoma (mole salamanders).
OH NCBI_TaxID=30343; Dryophytes versicolor (chameleon treefrog).
OH NCBI_TaxID=8404; Lithobates pipiens (Northern leopard frog) (Rana pipiens).
OH NCBI_TaxID=8316; Notophthalmus viridescens (Eastern newt) (Triturus viridescens).
OH NCBI_TaxID=45438; Rana sylvatica (Wood frog).
RN [1]
RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RX PubMed=15165820; DOI=10.1016/j.virol.2004.02.019;
RA Tan W.G., Barkman T.J., Gregory Chinchar V., Essani K.;
RT "Comparative genomic analyses of frog virus 3, type species of the
RT genus Ranavirus (family Iridoviridae).";
RL Virology 323:70-84(2004).
CC -!- FUNCTION: Transcription activation. {ECO:0000305}.
DR EMBL; AY548484; AAT09660.1; -; Genomic_DNA.
DR RefSeq; YP_031579.1; NC_005946.1.
DR ProteinModelPortal; Q6GZX4; -.
DR SwissPalm; Q6GZX4; -.
DR GeneID; 2947773; -.
DR KEGG; vg:2947773; -.
DR Proteomes; UP000008770; Genome.
DR GO; GO:0046782; P:regulation of viral transcription; IEA:InterPro.
DR InterPro; IPR007031; Poxvirus_VLTF3.
DR Pfam; PF04947; Pox_VLTF3; 1.
PE 4: Predicted;
KW Activator; Complete proteome; Reference proteome; Transcription;
KW Transcription regulation.
FT CHAIN 1 256 Putative transcription factor 001R.
FT /FTId=PRO_0000410512.
FT COMPBIAS 14 17 Poly-Arg.
SQ SEQUENCE 256 AA; 29735 MW; B4840739BF7D4121 CRC64;
MAFSAEDVLK EYDRRRRMEA LLLSLYYPND RKLLDYKEWS PPRVQVECPK APVEWNNPPS
EKGLIVGHFS GIKYKGEKAQ ASEVDVNKMC CWVSKFKDAM RRYQGIQTCK IPGKVLSDLD
AKIKAYNLTV EGVEGFVRYS RVTKQHVAAF LKELRHSKQY ENVNLIHYIL TDKRVDIQHL
EKDLVKDFKA LVESAHRMRQ GHMINVKYIL YQLLKKHGHG PDGPDILTVK TGSKGVLYDD
SFRKIYTDLG WKFTPL
//
I've created an updated GOslim "flatfile" for us to possibly use going forward. It contains only UniProt accessions that fall under GOslim terms with Biological Process aspect.
Here's the file heads up, it's 4.6GB):
http://gannet.fish.washington.edu/Atumefaciens/20190424_uniprot_go_goslim_P/uniprot_goslim_P.txt
Here's how I generated it:
https://github.com/RobertsLab/code/blob/master/notebooks/sam/20190424_swoose_uniprot_go_goslim.ipynb
Basically:
Let me know any thoughts/changes.
Ignore my previous post. Output file isn't useful - doesn't contain any of the GO terms! Doh!
here's a link to where the Robert's lab GO slim table was downloaded from:
http://www.informatics.jax.org/gotools/MGI_GO_Slim.html
and here's the link to the table itself: http://www.informatics.jax.org/gotools/data/input/map2MGIslim.txt
Found this referenced in Mac's 2010 paper
another use case have https://gannet.fish.washington.edu/Atumefaciens/20190701_pgen_maker_v074_annotation/blastp_annotation/blastp.outfmt6
but now want protein names and GO annotations
I have come across to code to possibly streamline this notebook. We should look at updating and making this notebook more robust..
https://github.com/RobertsLab/code/blob/master/10-blast-2-slim.ipynb
This should include updated SPID info. Will post code optimization suggestions