csbl-br / wikiora

Flask app for gene over-representation analysis based on Wikidata.
https://wikiora.sysbio.tools
MIT License
23 stars 2 forks source link

bug: Everything is nan when running the full list of named genes from hgnc #51

Open lubianat opened 3 months ago

lubianat commented 3 months ago

image

~43k genes from https://www.genenames.org/cgi-bin/download/custom?col=gd_app_sym&status=Approved&hgnc_dbtag=on&order_by=gd_app_sym_sort&format=text&where=(gd_pub_chrom_map%20not%20like%20%27%25patch%25%27%20and%20gd_pub_chrom_map%20not%20like%20%27%25alternate%20reference%20locus%25%27)&submit=submit

lubianat commented 3 months ago

For all protein coding genes, it is also weird

List: https://www.genenames.org/cgi-bin/download/custom?col=gd_app_sym&status=Approved&hgnc_dbtag=on&order_by=gd_app_sym_sort&format=text&where=(gd_pub_chrom_map%20not%20like%20%27%25patch%25%27%20and%20gd_pub_chrom_map%20not%20like%20%27%25alternate%20reference%20locus%25%27)%0Aand%20gd_locus_type%20=%20%27gene%20with%20protein%20product%27&submit=submit

Results:

image

lubianat commented 3 months ago

for biological processes it runs fine:

image

lubianat commented 3 months ago

ChatGPT comments:

Scenarios for "nan" (not a number):

Division by Zero or Invalid Operations:
    In the calculation of odds_ratio, if both the numerator and denominator are zero, it would typically be set to nan because 0000​ is undefined. However, the code uses max(1.0 * (n - x) * (N - x), 1) in the denominator to avoid division by zero, but there could still be scenarios where other operations lead to nan.
    The np.log10(p_value) operation can result in nan if p_value is zero or negative (logarithms of zero or negative numbers are undefined). This can lead to nan in combined_score.

Scenarios for "inf" (infinity):

Logarithm of a Very Small p-value:
    The -np.log10(p_value) component in combined_score can result in inf if p_value is extremely small. In floating-point arithmetic, if p_value is so small that it's effectively zero, np.log10(p_value) could approach negative infinity, and the negative sign in -np.log10(p_value) would make this inf.
Extremely Large odds_ratio:
    If the denominator (n - x) * (N - x) in the odds_ratio calculation is very close to zero (but not exactly zero due to the max function), the odds_ratio could become extremely large, possibly leading to inf.

Preventative Measures in Code:

max(..., 1) in the denominator: This avoids division by zero but can still lead to large values if the term in the max function is close to zero.
Handling log of zero or negative values: np.log10(p_value) should be handled carefully, ensuring p_value is never zero or negative, which could be addressed by checking p_value > 0 before computing the logarithm.
lubianat commented 3 months ago

Not necessarily a bug. Maybe put the estimate of the p-value to be min of 1e-100