faridrashidi / scphylo-tools

a python toolkit for single-cell tumor phylogenetic analysis
https://scphylo-tools.readthedocs.io
BSD 3-Clause "New" or "Revised" License
5 stars 2 forks source link

Ancestor-descendent accuracy returns wrong values as well as `ZeroDivisionError`s #8

Closed gordonkoehn closed 1 year ago

gordonkoehn commented 1 year ago

Description

Running the scphylo.tl.ad() returns wrong values and division by zero errors. Confident about the usage of input, see examples.

Minimal reproducible example

Example 1: wrong output

In this example scphylo.tl.ad() returns a wrong value, compared to the attached by-hand calculation for these small trees. See input arrays and tree topology attached.

tree1 / df1:
         0  3  2  1
Cell-1  0  0  0  0
Cell-2  0  0  1  0
Cell-3  0  0  1  1
Cell-4  1  0  1  1
Cell-5  1  1  1  1 

tree2 / df2:
         0  3  2  1
Cell-1  0  0  0  0
Cell-2  0  0  0  1
Cell-3  0  1  0  1
Cell-4  1  1  0  1
Cell-5  1  1  1  1

tree1:
4
└── 2
    └── 1
        └── 0
            └── 3
tree2:
4
└── 1
    └── 3
        └── 0
            └── 2
scphylo AD: 0.33333333333333337
correct AD: 0.42857142857142855

By hand calculation:

249461085-d20bebab-4385-48eb-a178-6e3949e3c281

Example 2: Division by Zero

In this example the scphylo.tl.ad() fails.

tree 1 / df1:
         2  0  1  3
Cell-1  0  0  0  0
Cell-2  0  1  0  0
Cell-3  0  0  1  0
Cell-4  1  0  0  0
Cell-5  0  0  0  1 

tree 2 / df2:
         2  0  1  3
Cell-1  0  0  0  0
Cell-2  1  0  0  0
Cell-3  1  1  0  0
Cell-4  1  0  1  0
Cell-5  0  0  0  1

tree 1: 
4
├── 0
├── 1
├── 2
└── 3
tree 2: 
4
├── 2
│   ├── 0
│   └── 1
└── 3

These inputs lead to a Division by Zero error.

See Traceback.

Traceback

for Example 2, this resulted in a ZeroDivisionError in the very last line of the ad function.

def ad(df_grnd, df_sol): ... return 1 - len(error_pairs) / n_adpairs ZeroDivisionError: division by zero

Version

0.0.2

Unofficial fork at https://github.com/pawel-czyz/scphylo-tools

faridrashidi commented 1 year ago

Hi,

Thank you for using scphylo-tools and bringing the issue to our attention. I wanted to clarify that the AD is calculated on the mutation tree, where the root is not considered as a mutation. I noticed that in your calculation, you included the root (number 4) in your AD calculation, which may result in an incorrect value. In your input data you only have 4 mutations while in your calculation you have 5 mutations! To obtain the correct AD value, please exclude the root from your calculation.

I appreciate you bringing the ZeroDivisionError to my attention. I will work on resolving the issue. The error occurs when the first tree is structured as a star, which means there are no AD pairs. I will make sure to handle this edge case properly so that the error does not occur again.

Let me know if you have any further questions. Thank you again for your feedback!

gordonkoehn commented 1 year ago

Hi Farid - thanks for your prompt and helpful reply.

Thanks for pointing out the convention used here. Yes, I indeed included the root as a mutation so far. Leaving it out makes more sense, to get the similarity of ancestral relationships. Just realizing now that, including the root effectively compares if mutations exist in both trees. Given this, I perfectly see how small star trees lead to an unexpected error.

Thank you for taking care of the error :)

PS: happy & heavy user of your MP3 distance

faridrashidi commented 1 year ago

Hi Gordon, sure anytime! I just handled the exception in this commit I'm glad it's been useful! Please let me know if you had any further issues with scphylo-tools.

gordonkoehn commented 1 year ago

Fantastic, thank you very much! - We'r looking into using more of your distances.