IndexError in `get_differential_biosynthesis`.

fubin1999 commented 5 months ago

Hi! I encountered an IndexError using get_differential_biosynthesis on my own dataset.

result = get_differential_biosynthesis(prepared_sub, group1_sub, group2_sub)

prepared_sub is a subset of my dataset containing only 6 samples for debugging. I've formatted this DataFrame with IUPAC strings as the first "glycan" column, and the following columns samples.

Full traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/miniforge3/envs/hcc/lib/python3.11/site-packages/glycowork/motif/processing.py:923, in rescue_glycans.<locals>.wrapper(*args, **kwargs)
    921 try:
    922   # Try running the original function
--> 923   return func(*args, **kwargs)
    924 except Exception as e:
    925   # If an error occurs, attempt to rescue the glycan sequences

File ~/miniforge3/envs/hcc/lib/python3.11/site-packages/glycowork/network/biosynthesis.py:744, in construct_network(glycans, allowed_ptms, edge_type, permitted_roots, abundances)
    743 if 'GlcNAc' in ''.join(permitted_roots) and any(g.count('Man') >= 5 for g in network.nodes()):
--> 744   add_high_man_removal(network)
    745 if abundances:

File ~/miniforge3/envs/hcc/lib/python3.11/site-packages/glycowork/network/biosynthesis.py:622, in add_high_man_removal(network)
    621 if target.count('Man') >= 5:
--> 622   diff_attr = network[source][target]['diff']
    623   edges_to_add.append((target, source, diff_attr))

KeyError: 'diff'

During handling of the above exception, another exception occurred:

IndexError                                Traceback (most recent call last)
Cell In[28], line 1
----> 1 result = get_differential_biosynthesis(prepared_sub, group1_sub, group2_sub)

File ~/miniforge3/envs/hcc/lib/python3.11/site-packages/glycowork/network/biosynthesis.py:1371, in get_differential_biosynthesis(df, group1, group2, analysis, paired)
   1369 root = list(infer_roots(df.index.tolist()))
   1370 root = max(root, key = len) if '-ol' not in root[0] else min(root, key = len)
-> 1371 nets = {col: estimate_weights(construct_network(df.index.tolist(), abundances = df[col].values.tolist()), root = root) for col in all_groups}
   1372 res = {col: get_maximum_flow(nets[col], source = root) for col in all_groups}
   1373 if analysis == "reaction":

File ~/miniforge3/envs/hcc/lib/python3.11/site-packages/glycowork/network/biosynthesis.py:1371, in <dictcomp>(.0)
   1369 root = list(infer_roots(df.index.tolist()))
   1370 root = max(root, key = len) if '-ol' not in root[0] else min(root, key = len)
-> 1371 nets = {col: estimate_weights(construct_network(df.index.tolist(), abundances = df[col].values.tolist()), root = root) for col in all_groups}
   1372 res = {col: get_maximum_flow(nets[col], source = root) for col in all_groups}
   1373 if analysis == "reaction":

File ~/miniforge3/envs/hcc/lib/python3.11/site-packages/glycowork/motif/processing.py:928, in rescue_glycans.<locals>.wrapper(*args, **kwargs)
    926 rescued_args = [canonicalize_iupac(arg) if isinstance(arg, str) else [canonicalize_iupac(a) for a in arg] if isinstance(arg, list) and isinstance(arg[0], str) else arg for arg in args]
    927 # After rescuing, attempt to run the function again
--> 928 return func(*rescued_args, **kwargs)

File ~/miniforge3/envs/hcc/lib/python3.11/site-packages/glycowork/network/biosynthesis.py:646, in construct_network(glycans, allowed_ptms, edge_type, permitted_roots, abundances)
    644 if permitted_roots is None:
    645   permitted_roots = infer_roots(glycans)
--> 646 abundance_mapping = {glycans[k]: abundances[k] for k in range(len(glycans))} if abundances else {}
    647 # Generating graph from adjacency of observed glycans
    648 min_size = min([k.count('(') for k in permitted_roots]) + 1

File ~/miniforge3/envs/hcc/lib/python3.11/site-packages/glycowork/network/biosynthesis.py:646, in <dictcomp>(.0)
    644 if permitted_roots is None:
    645   permitted_roots = infer_roots(glycans)
--> 646 abundance_mapping = {glycans[k]: abundances[k] for k in range(len(glycans))} if abundances else {}
    647 # Generating graph from adjacency of observed glycans
    648 min_size = min([k.count('(') for k in permitted_roots]) + 1

IndexError: list index out of range

More information for reference:

glycowork version: v1.2.0
Python version: 3.11
Device: MacBook Air M3
System: MacOS 14.5 (23F79)

I did a little debugging with PyCharm and found that abundances[k] with k being 62 (line 646 in biosynthesis.py) triggered the error. Note that prepared_sub had only 62 rows.

fubin1999 commented 5 months ago

Here is the dataset I used. The first three samples are group1 and the rest three are group2. prepared.csv

Bribak commented 5 months ago

Thanks for bringing this up! I think the issue might be that the edge attribute is 'diffs' but 'diff' is indexed here (we rarely do N-glycan biosynthetic network, which is why this didn't come up I suppose). I'll have a look at whether this fixes the issue.

fubin1999 commented 5 months ago

Thanks for your fast response. BTW, I find this function really useful and would like to know how it works. I noticed that it’s a new function of glycowork but couldn’t find any paper describing it. Will there be an upcoming preprint?

Bribak commented 5 months ago

This has been solved in 3b7fd64 The diffs/diff thing was indeed an issue. But I then encountered another issue: capacity bottlenecks with N-glycans, where there are many unobserved intermediate structures, resulting in everything being low variance-filtered. Minimum default capacity is now class-specific to avoid this.

Since we already pushed 1.3 to PyPI yesterday, this will not be part of 1.3 (it's currently on the dev branch). Might be 1.3.1 but it's honestly more likely that it'll come with 1.4 (autumn-ish?)

Glad that you find this functionality useful! It's still in a sort of beta (some fine-tuning + expansion is expected/required). It will be part of a preprint at some point (maybe/hopefully with 1.4?), but it's always a question of prioritization/time constraints:-)

BojarLab / glycowork

IndexError in `get_differential_biosynthesis`. #47