sr320 commented 5 years ago

I have come across to code to possibly streamline this notebook. We should look at updating and making this notebook more robust..

https://github.com/RobertsLab/code/blob/master/10-blast-2-slim.ipynb

This should include updated SPID info. Will post code optimization suggestions

sr320 commented 5 years ago

See this tweet https://twitter.com/strnr/status/1107717768378494976?s=12

kubu4 commented 5 years ago

What do you want the notebook to do that it doesn't already do? A rough outline of how you want it to work?

sr320 commented 5 years ago

One thought was getting your GO manipuation into one line I could understand :) see tweet.

Update SPID file - reassess all the different information that we might want to capture from UNIPROT.

Pound with lots of blast files to check edge cases.

Several of us have gene subset we want info on.

I would love a one stop magical notebook.

GO is not necessarily end goal - comprehensive annotation is.

On Mar 20, 2019, 8:52 AM -0700, kubu4 notifications@github.com, wrote:

What do you want the notebook to do that it doesn't already do? A rough outline of how you want it to work? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

shellywanamaker commented 5 years ago

Not sure if this is clear enough to be helpful, but here's how I spread the GO terms across columns and collapsed them back into one column in R (it's also in lines 74-92 in this code):

Subset "protein_ID" column and "GO IDs" column containing list of GO_IDs separated by ";" from all_sig0.1_pro_logFC_pval

sig0.1_pro_GO <- all_sig0.1_pro_logFC_pval[,c("protein_ID","GO_IDs")]

make empty data frame that will get filled in by for loop

sig0.1_pro_GOid_term <- data.frame()

loop through each line of data frame containing one column with "protein_ID" and one column with list of "GO_IDs" separated by ";"

for (i in 1:nrow(sig0.1_pro_GO)){

create a row that lists all GO IDs associated with one protein listed across rows, where each different GO ID gets put in it's own column

sig0.1_pro_GOid_term_row <- data.frame(t(data.frame(strsplit(as.character(sig0.1_pro_GO$GO_IDs[i]),'; ', fixed = TRUE))), stringsAsFactors = FALSE)

add each row created in the line above to the empty data frame

this will add NAs in columns where less terms exist; e.g. if a protein only has two terms and another has 10, there will be 8 NAs added to the row for the protein with 2 terms

sig0.1_pro_GOid_term <- rbind.fill(sig0.1_pro_GOid_term,sig0.1_pro_GOid_term_row) }

add protein IDs back to GO IDs

sig0.1_pro_GOid_term <- cbind(all_sig0.1_pro_logFC_pval[,"protein_ID"], sig0.1_pro_GOid_term)

this results in a table with protein ID listed in one column and each next column contains a GO ID that was listed with the protein in Uniprot DB.

str(sig0.1_pro_GOid_term)

reshape data so that all GO ID columns are gathered in one column called "GO"

STACKED_sig0.1_pro_GOid_term <- tidyr::gather(sig0.1_pro_GOid_term,"protein_ID","GO", 2:ncol(sig0.1_pro_GOid_term))

sr320 commented 5 years ago

use case where solution not evident - https://github.com/RobertsLab/resources/issues/618#issuecomment-474937754

sr320 commented 5 years ago

and @shellytrigg points hits on https://github.com/RobertsLab/resources/issues/614 how to do everything efficiently in one notebook (R vs bash)

shellywanamaker commented 5 years ago

Yes, all in one notebook would be better. I did the uniprot mapping/reformatting the mapped output using this jupyter notebook and the used the R script linked above to get the GO terms in usable format. All documented here: https://shellytrigg.github.io/40th-post/ .

kubu4 commented 5 years ago

I guess I need a more defined explanation of what the following would be:

Input file type/format
Desired output file format

use case where solution not evident - #618 (comment)

Isn't the solution the same as always? Perform BLAST to get Uniprot accession numbers, then join with Uniprot accession numbers and associated GO terms?

Update SPID file

It seems like this is the real solution to the whole thing. Get a massive SPID file with all available columns (e.g. SP IDs, taxonomic lineage, protein domains, protein names, GO terms and subsets, etc) and use that for joins. Then, user can select which columns that want to use for annotations.

sr320 commented 5 years ago

It seems like this is the real solution to the whole thing. Get a massive SPID file with all available columns (e.g. SP IDs, taxonomic lineage, protein domains, protein names, GO terms and subsets, etc) and use that for joins. Then, user can select which columns that want to use for annotations.

yes

action item is getting said massive SPID file...

and providing some guidance on likely desired outputs - (this answer could come from group) but as a test case @kubu4 will be finding gene set on Cv with high coverage, low coverage, high methylation, low methylation.... and we want to know everything possible about those genes (proteins)..

kubu4 commented 5 years ago

Here's the "default" layout of the entire SPID database file (listing a single entry below). This clearly has everything we'd want, however, parsing it won't be straightforward (for us, anyway) and it will be extremely difficult to "flatten" this file further, since each ID entry will have a varying number of features.

We might just have to devise a way to work with this format.

The web interface to UniProt does actually provide a way to download a flattened version of the database, but you have to select which columns you want:

https://www.uniprot.org/uniprot/?query=reviewed:yes#customize-columns

There are a TON of available columns. Maybe we can decide on a subset of those columns and then just download a nicely flattened file with our desired columns?

Here's an entry from the SPID database:

ID   001R_FRG3G              Reviewed;         256 AA.
AC   Q6GZX4;
DT   28-JUN-2011, integrated into UniProtKB/Swiss-Prot.
DT   19-JUL-2004, sequence version 1.
DT   16-JAN-2019, entry version 35.
DE   RecName: Full=Putative transcription factor 001R;
GN   ORFNames=FV3-001R;
OS   Frog virus 3 (isolate Goorha) (FV-3).
OC   Viruses; dsDNA viruses, no RNA stage; Iridoviridae; Alphairidovirinae;
OC   Ranavirus.
OX   NCBI_TaxID=654924;
OH   NCBI_TaxID=8295; Ambystoma (mole salamanders).
OH   NCBI_TaxID=30343; Dryophytes versicolor (chameleon treefrog).
OH   NCBI_TaxID=8404; Lithobates pipiens (Northern leopard frog) (Rana pipiens).
OH   NCBI_TaxID=8316; Notophthalmus viridescens (Eastern newt) (Triturus viridescens).
OH   NCBI_TaxID=45438; Rana sylvatica (Wood frog).
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RX   PubMed=15165820; DOI=10.1016/j.virol.2004.02.019;
RA   Tan W.G., Barkman T.J., Gregory Chinchar V., Essani K.;
RT   "Comparative genomic analyses of frog virus 3, type species of the
RT   genus Ranavirus (family Iridoviridae).";
RL   Virology 323:70-84(2004).
CC   -!- FUNCTION: Transcription activation. {ECO:0000305}.
DR   EMBL; AY548484; AAT09660.1; -; Genomic_DNA.
DR   RefSeq; YP_031579.1; NC_005946.1.
DR   ProteinModelPortal; Q6GZX4; -.
DR   SwissPalm; Q6GZX4; -.
DR   GeneID; 2947773; -.
DR   KEGG; vg:2947773; -.
DR   Proteomes; UP000008770; Genome.
DR   GO; GO:0046782; P:regulation of viral transcription; IEA:InterPro.
DR   InterPro; IPR007031; Poxvirus_VLTF3.
DR   Pfam; PF04947; Pox_VLTF3; 1.
PE   4: Predicted;
KW   Activator; Complete proteome; Reference proteome; Transcription;
KW   Transcription regulation.
FT   CHAIN         1    256       Putative transcription factor 001R.
FT                                /FTId=PRO_0000410512.
FT   COMPBIAS     14     17       Poly-Arg.
SQ   SEQUENCE   256 AA;  29735 MW;  B4840739BF7D4121 CRC64;
     MAFSAEDVLK EYDRRRRMEA LLLSLYYPND RKLLDYKEWS PPRVQVECPK APVEWNNPPS
     EKGLIVGHFS GIKYKGEKAQ ASEVDVNKMC CWVSKFKDAM RRYQGIQTCK IPGKVLSDLD
     AKIKAYNLTV EGVEGFVRYS RVTKQHVAAF LKELRHSKQY ENVNLIHYIL TDKRVDIQHL
     EKDLVKDFKA LVESAHRMRQ GHMINVKYIL YQLLKKHGHG PDGPDILTVK TGSKGVLYDD
     SFRKIYTDLG WKFTPL
//

kubu4 commented 5 years ago

I've created an updated GOslim "flatfile" for us to possibly use going forward. It contains only UniProt accessions that fall under GOslim terms with Biological Process aspect.

Here's the file heads up, it's 4.6GB):

http://gannet.fish.washington.edu/Atumefaciens/20190424_uniprot_go_goslim_P/uniprot_goslim_P.txt

Here's how I generated it:

https://github.com/RobertsLab/code/blob/master/notebooks/sam/20190424_swoose_uniprot_go_goslim.ipynb

Basically:

Get all GO IDs gene ontology consortium's "generic goslim" subset file
Use these IDs to pull out all lines matching those GO IDs from UniProt gene ontology annotation file AND only those with an aspect of "P" (biological process).

Let me know any thoughts/changes.

kubu4 commented 5 years ago

Ignore my previous post. Output file isn't useful - doesn't contain any of the GO terms! Doh!

shellywanamaker commented 5 years ago

here's a link to where the Robert's lab GO slim table was downloaded from:

http://www.informatics.jax.org/gotools/MGI_GO_Slim.html

and here's the link to the table itself: http://www.informatics.jax.org/gotools/data/input/map2MGIslim.txt

Found this referenced in Mac's 2010 paper

sr320 commented 5 years ago

another use case have https://gannet.fish.washington.edu/Atumefaciens/20190701_pgen_maker_v074_annotation/blastp_annotation/blastp.outfmt6

but now want protein names and GO annotations

RobertsLab / hackweek-2019

Optimize code in blast 2 go notebook #9

Subset "protein_ID" column and "GO IDs" column containing list of GO_IDs separated by ";" from all_sig0.1_pro_logFC_pval

make empty data frame that will get filled in by for loop

loop through each line of data frame containing one column with "protein_ID" and one column with list of "GO_IDs" separated by ";"

create a row that lists all GO IDs associated with one protein listed across rows, where each different GO ID gets put in it's own column

add each row created in the line above to the empty data frame

this will add NAs in columns where less terms exist; e.g. if a protein only has two terms and another has 10, there will be 8 NAs added to the row for the protein with 2 terms

add protein IDs back to GO IDs

this results in a table with protein ID listed in one column and each next column contains a GO ID that was listed with the protein in Uniprot DB.

reshape data so that all GO ID columns are gathered in one column called "GO"