Closed ricomnl closed 3 years ago
The behavior I had intended was that each row from the Encyclopedia output gets counted as only one protein. If we include all of them as I think is happening in this PR, then essentially we're "double dipping" for our enrichment analysis. This command was intended to grab only the accession from the first protein listed in the protein group:
accessions = proteins["Protein"].str.extract(f"\|(.+?)\|", expand=False)
For example, the proteins you listed above share 94.04% sequence identity, meaning we will almost never tell them apart:
CLUSTAL O(1.2.4) multiple sequence alignment
SP|Q9Y6H1|CHCH2_HUMAN MPRGSRSRTSRMAPPASRAPQMRAAPRPAPVAQPPAAAPPSAVGSSAAAPRQPGLMAQMA 60
SP|Q5T1J5|CHCH9_HUMAN MPRGSRSRTSRMAPPASRAPQMRAAPRPAPVAQPPAAAPPSAVGSSAAAPRQPGLMAQMA 60
************************************************************
SP|Q9Y6H1|CHCH2_HUMAN TTAAGVAVGSAVGHTLGHAITGGFSGGSNAEPARPDITYQEPQGTQPAQQQQPCLYEIKQ 120
SP|Q5T1J5|CHCH9_HUMAN TTAAGVAVGSAVGHTQGHAVTGGFSGGSNAEPARPDIAYQEPQGTQPAQQQQPCFYGIKQ 120
*************** ***:*****************:****************:* ***
SP|Q9Y6H1|CHCH2_HUMAN FLECAQNQGDIKLCEGFNEVLKQCRLANGLA 151
SP|Q5T1J5|CHCH9_HUMAN FLECAQNQGDIKLCEDFSKVLKQCRLAKGLA 151
***************.*.:********:***
If we include both of these, then we essentially double count the GO terms shared between them, when in reality it should only be counted once.
If the proteins.txt includes multiple proteins in the protein column encyclopedia formats it like this: sp|Q5T1J5|CHCH9_HUMAN;sp|Q9Y6H1|CHCH2_HUMAN Therefore we need to explode the column after splitting it by ';'.