elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
636 stars 98 forks source link

Difference between pandas.DataFrame.min() and eland.DataFrame.min() #121

Open stevedodson opened 4 years ago

stevedodson commented 4 years ago
>>> import eland as ed
>>> edf = ed.DataFrame('localhost', 'flights')
>>> pdf = ed.eland_to_pandas(edf)
>>> pdf
       AvgTicketPrice  Cancelled           Carrier                                          Dest DestAirportID  ...                                 OriginLocation OriginRegion        OriginWeather dayOfWeek           timestamp
0          841.265642      False   Kibana Airlines  Sydney Kingsford Smith International Airport           SYD  ...        {'lon': '8.570556', 'lat': '50.033333'}        DE-HE                Sunny         0 2018-01-01 00:00:00
1          882.982662      False  Logstash Airways                     Venice Marco Polo Airport          VE05  ...  {'lon': '18.60169983', 'lat': '-33.96480179'}        SE-BD                Clear         0 2018-01-01 18:27:00
2          190.636904      False  Logstash Airways                     Venice Marco Polo Airport          VE05  ...         {'lon': '12.3519', 'lat': '45.505299'}        IT-34                 Rain         0 2018-01-01 17:11:14
3          181.694216       True   Kibana Airlines                   Treviso-Sant'Angelo Airport          TV01  ...         {'lon': '14.2908', 'lat': '40.886002'}        IT-72  Thunder & Lightning         0 2018-01-01 10:33:28
4          730.041778      False   Kibana Airlines          Xi'an Xianyang International Airport           XIY  ...        {'lon': '-99.072098', 'lat': '19.4363'}       MX-DIF        Damaging Wind         0 2018-01-01 05:13:00
...               ...        ...               ...                                           ...           ...  ...                                            ...          ...                  ...       ...                 ...
13054     1080.446279      False  Logstash Airways          Xi'an Xianyang International Airport           XIY  ...         {'lon': '10.3927', 'lat': '43.683899'}        IT-52                Sunny         6 2018-02-11 20:42:25
13055      646.612941      False  Logstash Airways                                Zurich Airport           ZRH  ...  {'lon': '-97.23989868', 'lat': '49.90999985'}        CA-MB                 Rain         6 2018-02-11 01:41:57
13056      997.751876      False  Logstash Airways                             Ukrainka Air Base          XHBU  ...        {'lon': '-99.072098', 'lat': '19.4363'}       MX-DIF                Sunny         6 2018-02-11 04:09:27
13057     1102.814465      False          JetBeats      Ministro Pistarini International Airport           EZE  ...   {'lon': '135.4380035', 'lat': '34.78549957'}        SE-BD                 Hail         6 2018-02-11 08:28:21
13058      858.144337      False          JetBeats       Washington Dulles International Airport           IAD  ...        {'lon': '138.531006', 'lat': '-34.945'}        SE-BD                 Rain         6 2018-02-11 14:54:34

[13059 rows x 27 columns]
>>> pdf.min()
AvgTicketPrice                                100.021
Cancelled                                       False
Carrier                                        ES-Air
Dest                  Abu Dhabi International Airport
DestAirportID                                     ABQ
DestCityName                                Abu Dhabi
DestCountry                                        AE
DestRegion                                       AR-B
DestWeather                                     Clear
DistanceKilometers                                  0
DistanceMiles                                       0
FlightDelay                                     False
FlightDelayMin                                      0
FlightDelayType                         Carrier Delay
FlightNum                                     00882F6
FlightTimeHour                                      0
FlightTimeMin                                       0
Origin                Abu Dhabi International Airport
OriginAirportID                                   ABQ
OriginCityName                              Abu Dhabi
OriginCountry                                      AE
OriginRegion                                     AR-B
OriginWeather                                   Clear
dayOfWeek                                           0
timestamp                         2018-01-01 00:00:00
dtype: object
>>> edf.min()
AvgTicketPrice        100.020531
Cancelled               0.000000
DistanceKilometers      0.000000
DistanceMiles           0.000000
FlightDelay             0.000000
FlightDelayMin          0.000000
FlightTimeHour          0.000000
FlightTimeMin           0.000000
dayOfWeek               0.000000
dtype: float64
>>> 

Elasticsearch only supports 'min' on numeric + bool + timestamp fields.

P1llus commented 4 years ago

Would it be possible to elaborate a bit on described issue? To me the issue seems to be more to inform about the differences between eland DF and panda DF using the eland_to_pandas in util.py.

Do you want to make the eland DF do the same as pandas DF to begin with?

stevedodson commented 4 years ago

A goal of this project is to make the eland.DataFrame API as similar to pandas.DataFrame. Currently, eland.DataFrame.min() only returns min for numeric + bool + timestamp fields (+ numeric_only=True by default).

Elasticsearch can return 'min' for additional field types by using 'sort' or other methods. eland.operations.Operations._metric_aggs could be extended/refactored to return min or other aggs for other types.

V1NAY8 commented 3 years ago

@sethmlarson / @stevedodson Any thoughts on how to write a query this on text fields? I'll get this done.