Closed exue closed 5 years ago
@exue Now that I look at it in more detail, I wonder -- why do we care about duplicates that do not exist in the network?
In the original code (https://github.com/baryshnikova-lab/safepy/blob/master/safe_io.py) at line 312 we are re-indexing using node_label_order which, typically, is the list of nodes in the network and is always unique. The only time node_label_order is something else is when load_attributes is run without specifying node_label_order, in which case it gets assigned to node2attribute.index.values (line 300-301) which could be duplicated and therefore create a problem.
My question is: in what context are you running load_attributes? do you provide node_label_order?
The main reason is that if we leave in duplicates that are not in the network, pandas reindexing will fail
The error message is cannot reindex from a duplicate axis
with the from
axis being the node2attributes.index
- the to
axis, node_label_order
is unique.
I am calling safe.load_attributes
, which provides the network's node_label_order
to safe_io.load_attributes
. However, even if node_label_order
contains duplicates, reindexing will still succeed. Pandas Docs, or their test case.
Another option could be to remove the raw data that is not in the network (node_label_not_mapped
), but that is a larger change that would require more examination downstream to make sure nothing else is accidentally affected
Yep, got it, you're right. I made a unittest to test for this scenario and confirmed that it fails with the original code, but works ok with the changes.
Perfect, thanks for making the change and the test! I like the simplified is_unique check
Would error out if user input duplicate rows that were not in the network
Error:
ValueError: cannot reindex from a duplicate axis
On line:node2attribute = node2attribute.reindex(index=node_label_order, fill_value=fill_value)
Example input from AFT1 Yeast TF data from IDEA, modified to include duplicate entries:
ORF Value YMR056C -0.792311525 YMR056C -0.019015263 YMR056C -0.83412795 YMR056C -0.471975731 YNR016C -0.21730749 YBL015W -0.915305609 YLR304C -1.591613527 YJL200C -1.624091036 YLR153C -0.736739858 etc.
Example of input where only YBL015W and YLR153C are in the network itself YNR016C, YLR304C, and YJL200C are successfully ignored, but duplicate YMR056C causes issues: We check for duplicates via
if len(node_label_mapped) != len(set(node_label_mapped)):
to determine whether we should average values but this only checks the mapped rows/nodes. Meanwhile, we try to reindex the whole array, including nodes that don't exist in the networkSo if we only have duplicates in nodes that don't exist in the network, like YMR056C, they will not be caught by the old check (if there is at least one duplicate that exists in the network, the old check will catch it). To fix this, I changed the check to
if len(node_label_in_file) != len(set(node_label_in_file)):
Finally, I also noticed that the check
if not node_label_order:
, although working as intended (the defaultnode_label_order
parameter[]
will be evaluated asFalse
), I changed the default parameter toNone
to make this more clear.