elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
636 stars 98 forks source link

Add a way to access _score from DataFrame when using scoring filters #287

Open sethmlarson opened 3 years ago

sethmlarson commented 3 years ago

Relates to #282 it'd be nice to be able to access the _score value (and sort by it too). Need to find out how we should expose the _score information to users. My first thought was to include it as a "psuedo-scripted" field that has type float64:

df = ed.DataFrame(es, "nyc-restaurants")

print(df.es_match("blue").filter(["name", "_score"]))

                                            name    _score
ZckkjnQBvi72UTXObqxX             BLUE HAVEN EAST  5.523277
JckkjnQBvi72UTXOb60U         BLUE BAY RESTAURANT  5.523277
68kkjnQBvi72UTXOb60V       RIAZOR BLUE TAPAS BAR  4.813509
ackkjnQBvi72UTXOb64V  BLUE CAFE RESTAURANT & BAR  4.813509
BMkkjnQBvi72UTXOcLI8    BLUE SKY RESTAURANT CAFE  4.813509
...                                          ...       ...
A8wljnQBvi72UTXOrpgP          BLUE BOTTLE COFFEE  5.523277
LswljnQBvi72UTXOrZdc                  BLUE SMOKE  6.478565
QcwljnQBvi72UTXOrJWW                   BLUE RUIN  6.478565
XswljnQBvi72UTXOrZZc          BLUE BOTTLE COFFEE  5.523277
jswljnQBvi72UTXOq5K0              THE BLUE STOVE  5.523277

[556 rows x 2 columns]

Should all Eland DataFrames have this _score column by default with NaN values when there's no scoring happening? Or maybe we only add the column when using a scoring filter like es_match() and we do so automatically? Would love thoughts here.

stevedodson commented 2 years ago

@sethmlarson - I think adding this only to the return from es_match could be appropriate. However, I don't think this is required for GA.