Fix input deduplication e.g. two entries for one gene

exue commented 5 years ago

Would error out if user input duplicate rows that were not in the network

Error: ValueError: cannot reindex from a duplicate axis On line: node2attribute = node2attribute.reindex(index=node_label_order, fill_value=fill_value)

Example input from AFT1 Yeast TF data from IDEA, modified to include duplicate entries:

ORF Value YMR056C -0.792311525 YMR056C -0.019015263 YMR056C -0.83412795 YMR056C -0.471975731 YNR016C -0.21730749 YBL015W -0.915305609 YLR304C -1.591613527 YJL200C -1.624091036 YLR153C -0.736739858 etc.

Example of input where only YBL015W and YLR153C are in the network itself YNR016C, YLR304C, and YJL200C are successfully ignored, but duplicate YMR056C causes issues: We check for duplicates via if len(node_label_mapped) != len(set(node_label_mapped)): to determine whether we should average values but this only checks the mapped rows/nodes. Meanwhile, we try to reindex the whole array, including nodes that don't exist in the network

So if we only have duplicates in nodes that don't exist in the network, like YMR056C, they will not be caught by the old check (if there is at least one duplicate that exists in the network, the old check will catch it). To fix this, I changed the check to if len(node_label_in_file) != len(set(node_label_in_file)):

Finally, I also noticed that the check if not node_label_order:, although working as intended (the default node_label_order parameter [] will be evaluated as False), I changed the default parameter to None to make this more clear.

abarysh commented 5 years ago

@exue Now that I look at it in more detail, I wonder -- why do we care about duplicates that do not exist in the network?

In the original code (https://github.com/baryshnikova-lab/safepy/blob/master/safe_io.py) at line 312 we are re-indexing using node_label_order which, typically, is the list of nodes in the network and is always unique. The only time node_label_order is something else is when load_attributes is run without specifying node_label_order, in which case it gets assigned to node2attribute.index.values (line 300-301) which could be duplicated and therefore create a problem.

My question is: in what context are you running load_attributes? do you provide node_label_order?

exue commented 5 years ago

The main reason is that if we leave in duplicates that are not in the network, pandas reindexing will fail

The error message is cannot reindex from a duplicate axis with the from axis being the node2attributes.index - the to axis, node_label_order is unique.

I am calling safe.load_attributes, which provides the network's node_label_order to safe_io.load_attributes. However, even if node_label_order contains duplicates, reindexing will still succeed. Pandas Docs, or their test case.

Another option could be to remove the raw data that is not in the network (node_label_not_mapped), but that is a larger change that would require more examination downstream to make sure nothing else is accidentally affected

abarysh commented 5 years ago

Yep, got it, you're right. I made a unittest to test for this scenario and confirmed that it fails with the original code, but works ok with the changes.

exue commented 5 years ago

Perfect, thanks for making the change and the test! I like the simplified is_unique check

baryshnikova-lab / safepy

Fix input deduplication e.g. two entries for one gene #1