Added explode column to encyclopedia parser

ricomnl commented 3 years ago

If the proteins.txt includes multiple proteins in the protein column encyclopedia formats it like this: sp|Q5T1J5|CHCH9_HUMAN;sp|Q9Y6H1|CHCH2_HUMAN Therefore we need to explode the column after splitting it by ';'.

wfondrie commented 3 years ago

The behavior I had intended was that each row from the Encyclopedia output gets counted as only one protein. If we include all of them as I think is happening in this PR, then essentially we're "double dipping" for our enrichment analysis. This command was intended to grab only the accession from the first protein listed in the protein group:

accessions = proteins["Protein"].str.extract(f"\|(.+?)\|", expand=False)

wfondrie commented 3 years ago

For example, the proteins you listed above share 94.04% sequence identity, meaning we will almost never tell them apart:

CLUSTAL O(1.2.4) multiple sequence alignment

SP|Q9Y6H1|CHCH2_HUMAN MPRGSRSRTSRMAPPASRAPQMRAAPRPAPVAQPPAAAPPSAVGSSAAAPRQPGLMAQMA 60
SP|Q5T1J5|CHCH9_HUMAN MPRGSRSRTSRMAPPASRAPQMRAAPRPAPVAQPPAAAPPSAVGSSAAAPRQPGLMAQMA 60
                      ************************************************************

SP|Q9Y6H1|CHCH2_HUMAN TTAAGVAVGSAVGHTLGHAITGGFSGGSNAEPARPDITYQEPQGTQPAQQQQPCLYEIKQ 120
SP|Q5T1J5|CHCH9_HUMAN TTAAGVAVGSAVGHTQGHAVTGGFSGGSNAEPARPDIAYQEPQGTQPAQQQQPCFYGIKQ 120
                      *************** ***:*****************:****************:* ***

SP|Q9Y6H1|CHCH2_HUMAN FLECAQNQGDIKLCEGFNEVLKQCRLANGLA 151
SP|Q5T1J5|CHCH9_HUMAN FLECAQNQGDIKLCEDFSKVLKQCRLAKGLA 151
                      ***************.*.:********:***

If we include both of these, then we essentially double count the GO terms shared between them, when in reality it should only be counted once.

TalusBio / gopher

Added explode column to encyclopedia parser #2