Open swvanderlaan opened 12 months ago
Hi,
there is a filter function.
You can use it to filter by status code (the index of that table), and DAF (the column you get after check_af()
).
.filter_value(expr)
: return a new Sumstats object
.filter_value(expr, inplace=True)
: perform the filter inplace
Note: the datatype of status code is string
FYI, one of the possible mismatch could be the indels. In current version of gwaslab, indels will be check by REF and ALT in the reference vcf, not using EAF. In some rare cases, at the same position, there could be both insertion and deletion with same allele pair. For example, 1:123:C:CCA and 1:123:CCA:C. I will add an additional EAF filter for this in future version to avoid possible mismatch.
Ok. Admittedly I am a newbie with Python, but I have the SumStats-object like below. Now I want to filter on CAVEAT
.
SNPID CHR POS EA NEA EAF BETA SE P N ... P_SQRTN DF DIRECTIONS COCHRANS_Q P_COCHRANS_Q I_SQUARED TAU_SQUARED CAVEAT rsID DAF
0 1:10177:A:AC 1 10177 AC A 0.382 0.0011 0.0178 0.95263 23431 ... 0.96751 2 ....+--.. 2.399 0.30136 16.6 0.000 None rs367896724 -0.023567
1 1:10352:T:TA 1 10352 TA T 0.399 0.0057 0.0177 0.74551 23508 ... 0.60000 2 ....--+.. 1.510 0.47009 0.0 0.000 None rs555500075 -0.027441
2 1:10616:C:CCGCCGTTGCAAAGGCGCGCCG 1 10616 C CCGCCGTTGCAAAGGCGCGCCG 0.992 -0.0089 0.0923 0.92287 23308 ... 0.82261 1 ....-.+.. 0.000 1.00000 0.0 0.000 None <NA> -0.002036
3 1:11008:C:G 1 11008 G C 0.088 0.0287 0.0275 0.29687 24310 ... 0.16065 2 ....--+.. 4.317 0.11552 53.7 0.004 None <NA> -0.000469
4 1:11012:C:G 1 11012 G C 0.087 0.0204 0.0280 0.46590 24310 ... 0.28030 2 ....--+.. 3.647 0.16147 45.2 0.003 None rs544419019 -0.001469
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12562931 X:155246736:G:A 23 155246736 A G 0.124 -0.0085 0.0117 0.46572 35689 ... 0.46572 0 ........- 0.000 1.00000 0.0 0.000 None None NaN
12562932 X:155251634:G:T 23 155251634 T G 0.096 -0.0100 0.0138 0.46838 35689 ... 0.46838 0 ........- 0.000 1.00000 0.0 0.000 None None NaN
12562933 X:155254706:C:T 23 155254706 T C 0.123 -0.0125 0.0121 0.29996 35689 ... 0.29996 0 ........- 0.000 1.00000 0.0 0.000 None None NaN
12562934 X:155255182:G:A 23 155255182 A G 0.007 0.0080 0.0473 0.86563 35689 ... 0.86563 0 ........+ 0.000 1.00000 0.0 0.000 None None NaN
12562935 X:155258654:C:G 23 155258654 G C 0.106 -0.0075 0.0127 0.55636 35689 ... 0.55636 0 ........- 0.000 1.00000 0.0 0.000 None None NaN
And these are the datatypes:
SNPID object
CHR Int64
POS Int64
EA category
NEA category
EAF float32
BETA float32
SE float32
P float64
N Int32
STATUS string
Z_SQRTN float64
P_SQRTN float64
DF int64
DIRECTIONS object
COCHRANS_Q float64
P_COCHRANS_Q float64
I_SQUARED float64
TAU_SQUARED float64
CAVEAT object
rsID object
DAF float64
dtype: object
Now I wanted to filter on CAVEAT
, like this:
gwas_data_sumstats_qc = gwas_data_sumstats.filter_value(
'CAVEAT=="None"'
)
Somehow it is filtering away everything. How should I approach this filtering then?
Hi, I guess you could try either of the followings:
gwas_data_sumstats_qc = gwas_data_sumstats.filter_value( 'CAVEAT.isnull()' )
gwas_data_sumstats_qc = gwas_data_sumstats.filter_value( 'CAVEAT != CAVEAT ' )
gwas_data_sumstats_qc = gwas_data_sumstats.filter_value( 'CAVEAT in [None]' )
For such None values, it is quite tricky for querying as described in https://stackoverflow.com/questions/26535563/querying-for-nan-and-other-names-in-pandas. I will add a note for this in the documents. Thanks for your feedback!
How can I filter on the
lookupstatus()
? Suppose my table looks like the one below. How can I filter c.q. select on a value in theAlign
-column?This way I could filter, but also further inspect what is going on.
And related: how can I use the results from the allele frequency comparison? So suppose I get a (not so) nice cross ;-), now I want to inspect those flipped variants. How can I extract that subset?
Thanks!