Clean/refactor Figure 3

ErinWeisbart commented 1 year ago

Changes to note: Using the updated CCLE data has a small effect on our hit calling as the control groups are slightly changed. This has a ripple effect causing minor changes to many numbers/figures in the notebook. Visible changes include: Fig 2A/B: The number of compartment-specific and whole cell hits has changed from

DMEM: 2332/2349 to 2368/2349
HPLM: 1236/3465 to 1313/3456

ErinWeisbart commented 1 year ago

@MerajRamezani I'm stuck on cleaning up Figure 3C. Starting with DMEM, I get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[21], line 48
     46 hit_corr_dic = {}
     47 for s in hit_pair_set:
---> 48     hit_corr_dic[s] = corr_dic[s]
     50 print(f'For condition {condition} \n Number of hit pairs is {len(hit_pair_set)} \n',
     51     f'Number of hit pairs with correlation is {len(hit_corr_dic)}')
     53 parent_corr_dic[condition] = corr_dic

KeyError: frozenset({'MALT1'})

Can you take a look and figure out what's going on? Or schedule a meeting so we can talk through it together?

MerajRamezani commented 1 year ago

@MerajRamezani I'm stuck on cleaning up Figure 3C. Starting with DMEM, I get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[21], line 48
     46 hit_corr_dic = {}
     47 for s in hit_pair_set:
---> 48     hit_corr_dic[s] = corr_dic[s]
     50 print(f'For condition {condition} \n Number of hit pairs is {len(hit_pair_set)} \n',
     51     f'Number of hit pairs with correlation is {len(hit_corr_dic)}')
     53 parent_corr_dic[condition] = corr_dic

KeyError: frozenset({'MALT1'})

Can you take a look and figure out what's going on? Or schedule a meeting so we can talk through it together?

@ErinWeisbart this what I mentioned in the meeting last week. I already have added some code to resolve this issue:

# Create a list of protein clusters with all complexes that had at least 66% of genes represented within the Hela DMEM WGS hits
cluster_count = 0
hit_cluster_list_list = []
hit_set = set()
for i in range(len(ppi_data_h)):
    cluster = ppi_data_h.iloc[i]['subunits(Gene name)'].split(';')
    count = 0
    hit_cluster_list = []
    for g in cluster:
        if g in genes:
            count += 1
            hit_set.add(g)
            hit_cluster_list.append(g)
    if (count/len(cluster)) >= 0.66:
        cluster_count += 1
    if hit_cluster_list and (count/len(cluster)) >= 0.66:
        hit_cluster_list_list.append(hit_cluster_list)
print(len(hit_set),cluster_count,len(hit_cluster_list_list))

# Assign correlations to hit gene pairs
hit_pair_set = set()
for l in hit_cluster_list_list:
    for c in list(permutations(l,2)):
        hit_pair_set.add(frozenset(c))

hit_corr_dic = {}
for s in hit_pair_set:
    hit_corr_dic[s] = corr_dic[s]

print(' Number of hit pairs',len(hit_pair_set),'\n',
      'Number of hit pairs with correlation',len(hit_corr_dic))

Considering that we decided to use PCA in these analysis maybe it make sense for me to update the CORUM & STRING analysis before you rerun all sections? I am also available to meet based on my calendar openings.

ErinWeisbart commented 1 year ago

Though this isn't quite finished, I'm going to merge it into main. I can bypass the error I discussed above by ensuring that the hit_pair_set has a length of 2. I'm now tracking cleanup needed in #12

broadinstitute / 2022_PERISCOPE

Clean/refactor Figure 3 #13