Closed clm-a closed 3 months ago
This turns out to be a bug in Polars: https://github.com/pola-rs/polars/commit/c1a1a015ac2cb78b73756f192f99f28e066ae198#r144864748
You don't need to pass all fields in your dtypes. If you pass all fields, Polars will consider it to be both names and types. But if you pass even one field less, then it uses only the dtype. I suggest removing one of your columns that is already a string (or you don't want to cast to string).
Thanks for you reply !
I declared all the columns as strings (then casted later on) as a workaround because I didn't find the type inference satisfying :
WIth
df =
DF.load_csv!(File.read!(Kino.FS.file_path("2016_Building_Energy_Benchmarking.csv")),
delimiter: ","
)
I get some inference errors like Polars Error: could not parse '201620142012201120082007' as dtype 'i64' at column 'YearsENERGYSTARCertified'
, even if I pass "infer_schema_length: nil"
Here, column / data are matching, but it could have been inferred as string with a full data scan (I may have encountered the problem on other columns and in previous Explorer versions too).
That said, I tried to specify only the dtypes I consider relevant :
dtypes = [
{"OSEBuildingID", :string},
{"ZipCode", :string},
{"TaxParcelIdentificationNumber", :string},
{"YearBuilt", :u16},
{"NumberofBuildings", :u16},
{"NumberofFloors", :u16},
{"PropertyGFATotal", :float},
{"PropertyGFAParking", :float},
{"PropertyGFABuilding(s)", :float},
{"LargestPropertyUseTypeGFA", :float},
{"SecondLargestPropertyUseTypeGFA", :float},
{"ThirdLargestPropertyUseTypeGFA", :float},
{"YearsENERGYSTARCertified", :string},
{"ENERGYSTARScore", :float},
{"SiteEUI(kBtu/sf)", :float},
{"SiteEUIWN(kBtu/sf)", :float},
{"SourceEUI(kBtu/sf)", :float},
{"SourceEUIWN(kBtu/sf)", :float},
{"SiteEnergyUse(kBtu)", :float},
{"SiteEnergyUseWN(kBtu)", :float},
{"SteamUse(kBtu)", :float},
{"Electricity(kWh)", :float},
{"Electricity(kBtu)", :float},
{"NaturalGas(therms)", :float},
{"NaturalGas(kBtu)", :float},
{"TotalGHGEmissions", :float},
{"GHGEmissionsIntensity", :float}]
df =
DF.load_csv!(File.read!(Kino.FS.file_path("2016_Building_Energy_Benchmarking.csv")),
delimiter: ",",
dtypes: dtypes
)
I get this : "Polars Error: could not parse 'NonResidential' as dtype 'f64' at column 'Electricity(kWh)' (column number 3)
Column number 3 should indeed store BuildingType
and not Electricity(kWh)
.
:raised_eyebrow: :raised_eyebrow:
I cannot reproduce the same failure as you for the dtypes above. Instead I get:
~/OSS/explorer[main %]$ MIX_ENV=test mix run foo.exs
** (RuntimeError) from_csv failed: Polars Error: could not parse `NULL` as dtype `f64` at column 'ENERGYSTARScore' (column number 29)
The current offset in the file is 864558 bytes.
You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `NULL` to the `null_values` list.
Original error: ```remaining bytes non-empty```
(explorer 0.10.0-dev) lib/explorer/data_frame.ex:621: Explorer.DataFrame.from_csv!/2
foo.exs:32: (file)
FWIW, I can parse the CSV using these arguments:
delimiter: ",",
dtypes: %{"YearsENERGYSTARCertified" => :string, "TaxParcelIdentificationNumber" => :string, "Electricity(kBtu)" => :string},
infer_schema_length: 20000
But the original issue has to be fixed in Polars anyway.
It turns out there was a bug in Explorer as well, but there is a bug in Polars too.
Hello !
When I load those data from CSV, I get the col names in an unexpected order :
output:
Neighborhood, TotalGHGEmission, Electricity(kBtu) are not at the right place, the right names respectively are : OSEBuildingID, DataYear, BuildingType...
With Explorer 8.3, the names are OK
Am I missing something ? Thank you <3 !