SIMPLE-AstroDB / SIMPLE-db

BSD 3-Clause "New" or "Revised" License
11 stars 22 forks source link

Nan vs None [Discussion] #363

Closed kelle closed 1 year ago

kelle commented 1 year ago

I assumed that None was the best thing to use to indicate missing data. Googling (and @Ajb2307's data science partner) indicate that is not always the case! (See the most popular answer to this SO post.). Question: Does Astro and/or STScI have a stance on this? @dr-rodriguez ?

Ajb2307 commented 1 year ago

For missing Data Nan can be prefered as it can be inputted into a mathematical function while None cannot

image
dr-rodriguez commented 1 year ago

This is dependent of the programming language and how missing data is stored. Databases have a concept of NULL and that works across datatypes. As you can imagine, that sometimes has issues because np.nan is explicitly a float (ie, there is no nan integer). I will point out that pandas 2.0 is moving towards the Apache Arrow datatypes, that I believe handle missing data better than numpy. At the end of the day, we may be stuck with None as it is is a python object as opposed to float/string/int. What we pick for astrodbkit2 must be datatype independent.

dr-rodriguez commented 1 year ago

If you need to do math on a database result, I think it is up to the user to sanitize it and remove missing values.