Filtering based on lookupstatus() and check_af()

swvanderlaan commented 12 months ago

How can I filter on the lookupstatus()? Suppose my table looks like the one below. How can I filter c.q. select on a value in the Align-column?

This way I could filter, but also further inspect what is going on.

    Genome_Build    rsID&SNPID  CHR&POS Stadardize&Normalize    Align   Panlidromic_SNP&Indel   Count   Percentage(%)
1960009 hg19    rsid unknown & SNPID valid  CHR valid & POS valid   standardized SNP    Match: NEA=REF  Unchecked   9973036 79.38
1960019 hg19    rsid unknown & SNPID valid  CHR valid & POS valid   standardized SNP    Flipped_fixed   Unchecked   1438688 11.45
1960309 hg19    rsid unknown & SNPID valid  CHR valid & POS valid   standardized & normalized indel Match: NEA=REF  Unchecked   60813   0.48
1960319 hg19    rsid unknown & SNPID valid  CHR valid & POS valid   standardized & normalized indel Flipped_fixed   Unchecked   29135   0.23
1960363 hg19    rsid unknown & SNPID valid  CHR valid & POS valid   standardized & normalized indel Both_alleles_on_ref+indistinguishable   Indel_match 854691  6.8
1960364 hg19    rsid unknown & SNPID valid  CHR valid & POS valid   standardized & normalized indel Both_alleles_on_ref+indistinguishable   Indel_flipped_fixed 183613  1.46
1960368 hg19    rsid unknown & SNPID valid  CHR valid & POS valid   standardized & normalized indel Both_alleles_on_ref+indistinguishable   No_matching_or_no_info  22960   0.18

And related: how can I use the results from the allele frequency comparison? So suppose I get a (not so) nice cross ;-), now I want to inspect those flipped variants. How can I extract that subset?

Thanks!

Cloufield commented 11 months ago

Hi, there is a filter function. You can use it to filter by status code (the index of that table), and DAF (the column you get after check_af()).

.filter_value(expr): return a new Sumstats object .filter_value(expr, inplace=True): perform the filter inplace

Note: the datatype of status code is string

Cloufield commented 11 months ago

FYI, one of the possible mismatch could be the indels. In current version of gwaslab, indels will be check by REF and ALT in the reference vcf, not using EAF. In some rare cases, at the same position, there could be both insertion and deletion with same allele pair. For example, 1:123:C:CCA and 1:123:CCA:C. I will add an additional EAF filter for this in future version to avoid possible mismatch.

swvanderlaan commented 11 months ago

Ok. Admittedly I am a newbie with Python, but I have the SumStats-object like below. Now I want to filter on CAVEAT.

    SNPID   CHR POS EA  NEA EAF BETA    SE  P   N   ... P_SQRTN DF  DIRECTIONS  COCHRANS_Q  P_COCHRANS_Q    I_SQUARED   TAU_SQUARED CAVEAT  rsID    DAF
0   1:10177:A:AC    1   10177   AC  A   0.382   0.0011  0.0178  0.95263 23431   ... 0.96751 2   ....+--..   2.399   0.30136 16.6    0.000   None    rs367896724 -0.023567
1   1:10352:T:TA    1   10352   TA  T   0.399   0.0057  0.0177  0.74551 23508   ... 0.60000 2   ....--+..   1.510   0.47009 0.0 0.000   None    rs555500075 -0.027441
2   1:10616:C:CCGCCGTTGCAAAGGCGCGCCG    1   10616   C   CCGCCGTTGCAAAGGCGCGCCG  0.992   -0.0089 0.0923  0.92287 23308   ... 0.82261 1   ....-.+..   0.000   1.00000 0.0 0.000   None    <NA>    -0.002036
3   1:11008:C:G 1   11008   G   C   0.088   0.0287  0.0275  0.29687 24310   ... 0.16065 2   ....--+..   4.317   0.11552 53.7    0.004   None    <NA>    -0.000469
4   1:11012:C:G 1   11012   G   C   0.087   0.0204  0.0280  0.46590 24310   ... 0.28030 2   ....--+..   3.647   0.16147 45.2    0.003   None    rs544419019 -0.001469
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12562931    X:155246736:G:A 23  155246736   A   G   0.124   -0.0085 0.0117  0.46572 35689   ... 0.46572 0   ........-   0.000   1.00000 0.0 0.000   None    None    NaN
12562932    X:155251634:G:T 23  155251634   T   G   0.096   -0.0100 0.0138  0.46838 35689   ... 0.46838 0   ........-   0.000   1.00000 0.0 0.000   None    None    NaN
12562933    X:155254706:C:T 23  155254706   T   C   0.123   -0.0125 0.0121  0.29996 35689   ... 0.29996 0   ........-   0.000   1.00000 0.0 0.000   None    None    NaN
12562934    X:155255182:G:A 23  155255182   A   G   0.007   0.0080  0.0473  0.86563 35689   ... 0.86563 0   ........+   0.000   1.00000 0.0 0.000   None    None    NaN
12562935    X:155258654:C:G 23  155258654   G   C   0.106   -0.0075 0.0127  0.55636 35689   ... 0.55636 0   ........-   0.000   1.00000 0.0 0.000   None    None    NaN

And these are the datatypes:

SNPID             object
CHR                Int64
POS                Int64
EA              category
NEA             category
EAF              float32
BETA             float32
SE               float32
P                float64
N                  Int32
STATUS            string
Z_SQRTN          float64
P_SQRTN          float64
DF                 int64
DIRECTIONS        object
COCHRANS_Q       float64
P_COCHRANS_Q     float64
I_SQUARED        float64
TAU_SQUARED      float64
CAVEAT            object
rsID              object
DAF              float64
dtype: object

Now I wanted to filter on CAVEAT, like this:

gwas_data_sumstats_qc = gwas_data_sumstats.filter_value(
    'CAVEAT=="None"'
)

Somehow it is filtering away everything. How should I approach this filtering then?

Cloufield commented 11 months ago

Hi, I guess you could try either of the followings:

gwas_data_sumstats_qc = gwas_data_sumstats.filter_value( 'CAVEAT.isnull()' )

gwas_data_sumstats_qc = gwas_data_sumstats.filter_value( 'CAVEAT != CAVEAT ' )

gwas_data_sumstats_qc = gwas_data_sumstats.filter_value( 'CAVEAT in [None]' )

For such None values, it is quite tricky for querying as described in https://stackoverflow.com/questions/26535563/querying-for-nan-and-other-names-in-pandas. I will add a note for this in the documents. Thanks for your feedback!

Cloufield / gwaslab

Filtering based on lookupstatus() and check_af() #73