DataResponsibly / DataSynthesizer

MIT License
257 stars 85 forks source link

Null values treated as strings #1

Closed drakar closed 7 years ago

drakar commented 7 years ago

Hi,

In a float or int field, it appears that the pandas lib treats them as string fields rather than flat with null value.

Is there anyway to force float, either in the UI or in the pandas read method?

haoyueping commented 7 years ago

Hi drakar,

When missing values are spaces, pandas regards the whole column as strings, which might be the reason for this data type error.

The DataDescriber is just updated to skip the initial spaces when reading values from the cells of input CSV file. Please try and let me know if it works.

drakar commented 7 years ago

It still is not working. Here is what I do:

>>> import pandas as pd
>>> df = pd.read_csv("file.csv")
>>> type(df['latitude'][1])
<class` 'str'>
>>>

In the .CSV file the fields are literally '<null>' which is a string, but can it be treated as a null value or coerced into a float?

stoyanovich commented 7 years ago

Haoyue,

Please look into this.

Julia.

On 8/30/17 1:33 PM, Aaron Drake wrote:

It still is not working. Here is what I do:

import pandas as pd df = pd.read_csv("file.csv") type(df['latitude'][1]) <class` 'str'>

In the .CSV file the fields are literally '' which is a string, but can it be treated as a null value or coerced into a float?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FDataResponsibly%2FDataSynthesizer%2Fissues%2F1%23issuecomment-326063503&data=02%7C01%7Cjs3735%40drexel.edu%7Cf825b2eda4c04b84d85608d4efcd29d0%7C3664e6fa47bd45a696708c4f080f8ca6%7C0%7C0%7C636397111828729701&sdata=yS8MWEryojT34CfByGr9hx8RwjHraNYCUc4vJj5l6Xw%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAFstk1TNbb5o9w45Hpdap6TaVTDtC-YCks5sdZzMgaJpZM4PExeo&data=02%7C01%7Cjs3735%40drexel.edu%7Cf825b2eda4c04b84d85608d4efcd29d0%7C3664e6fa47bd45a696708c4f080f8ca6%7C0%7C0%7C636397111828729701&sdata=Lc9gYt1G4T%2FY%2BSeF5%2FMfVVxe%2FfJ5rWjjEfqJYVuJOzo%3D&reserved=0.

haoyueping commented 7 years ago

Pandas allows user-defined NULL values by parameter na_values when reading CSV file. See pandas.read_csv.

DataSynthesizer supports this functionality now, which is essentially by adding parameter null_values and passing it to pandas.read_csv in DataDescriber.read_dataset_from_csv.

In your case, you can try df = pd.read_csv("file.csv", na_values="'<null>'").

drakar commented 7 years ago

@haoyueping Thank you very much!

haoyueping commented 7 years ago

@drakar No problem. Thanks for your feedback!