Orders of names after CSV loading seems buggy since 9.0

clm-a commented 3 months ago

Hello !

When I load those data from CSV, I get the col names in an unexpected order :


columns = ["OSEBuildingID", "DataYear", "BuildingType", "PrimaryPropertyType", "PropertyName", "Address", "City", "State", "ZipCode", "TaxParcelIdentificationNumber", "CouncilDistrictCode", "Neighborhood", "Latitude", "Longitude", "YearBuilt", "NumberofBuildings", "NumberofFloors", "PropertyGFATotal", "PropertyGFAParking", "PropertyGFABuilding(s)", "ListOfAllPropertyUseTypes", "LargestPropertyUseType", "LargestPropertyUseTypeGFA", "SecondLargestPropertyUseType", "SecondLargestPropertyUseTypeGFA", "ThirdLargestPropertyUseType", "ThirdLargestPropertyUseTypeGFA", "YearsENERGYSTARCertified", "ENERGYSTARScore", "SiteEUI(kBtu/sf)", "SiteEUIWN(kBtu/sf)", "SourceEUI(kBtu/sf)", "SourceEUIWN(kBtu/sf)", "SiteEnergyUse(kBtu)", "SiteEnergyUseWN(kBtu)", "SteamUse(kBtu)", "Electricity(kWh)", "Electricity(kBtu)", "NaturalGas(therms)", "NaturalGas(kBtu)", "DefaultData", "Comments", "ComplianceStatus", "Outlier", "TotalGHGEmissions", "GHGEmissionsIntensity"]
dtypes = Enum.map(columns, &{&1, :string})

df =
  DF.load_csv!(File.read!(Kino.FS.file_path("2016_Building_Energy_Benchmarking.csv")),
    delimiter: ",",
    dtypes: dtypes
  )

output:

#Explorer.DataFrame<
  Polars[3376 x 46]
  Neighborhood string ["1", "2", "3", "5", "8", ...]
  TotalGHGEmissions string ["2016", "2016", "2016", "2016", "2016", ...]
  Electricity(kBtu) string ["NonResidential", "NonResidential", "NonResidential", "NonResidential",
   "NonResidential", ...]
.....

Neighborhood, TotalGHGEmission, Electricity(kBtu) are not at the right place, the right names respectively are : OSEBuildingID, DataYear, BuildingType...

With Explorer 8.3, the names are OK

Am I missing something ? Thank you <3 !

josevalim commented 3 months ago

This turns out to be a bug in Polars: https://github.com/pola-rs/polars/commit/c1a1a015ac2cb78b73756f192f99f28e066ae198#r144864748

You don't need to pass all fields in your dtypes. If you pass all fields, Polars will consider it to be both names and types. But if you pass even one field less, then it uses only the dtype. I suggest removing one of your columns that is already a string (or you don't want to cast to string).

clm-a commented 3 months ago

Thanks for you reply !

I declared all the columns as strings (then casted later on) as a workaround because I didn't find the type inference satisfying :

WIth

df =
  DF.load_csv!(File.read!(Kino.FS.file_path("2016_Building_Energy_Benchmarking.csv")),
    delimiter: ","
  )

I get some inference errors like Polars Error: could not parse '201620142012201120082007' as dtype 'i64' at column 'YearsENERGYSTARCertified', even if I pass "infer_schema_length: nil"

Here, column / data are matching, but it could have been inferred as string with a full data scan (I may have encountered the problem on other columns and in previous Explorer versions too).

That said, I tried to specify only the dtypes I consider relevant :

dtypes = [
  {"OSEBuildingID", :string},
  {"ZipCode", :string},
  {"TaxParcelIdentificationNumber", :string},
  {"YearBuilt", :u16},
  {"NumberofBuildings", :u16},
  {"NumberofFloors", :u16},
  {"PropertyGFATotal", :float},
  {"PropertyGFAParking", :float},
  {"PropertyGFABuilding(s)", :float},
  {"LargestPropertyUseTypeGFA", :float},
  {"SecondLargestPropertyUseTypeGFA", :float},
  {"ThirdLargestPropertyUseTypeGFA", :float},
  {"YearsENERGYSTARCertified", :string},
  {"ENERGYSTARScore", :float},
  {"SiteEUI(kBtu/sf)", :float},
  {"SiteEUIWN(kBtu/sf)", :float},
  {"SourceEUI(kBtu/sf)", :float},
  {"SourceEUIWN(kBtu/sf)", :float},
  {"SiteEnergyUse(kBtu)", :float},
  {"SiteEnergyUseWN(kBtu)", :float},
  {"SteamUse(kBtu)", :float},
  {"Electricity(kWh)", :float},
  {"Electricity(kBtu)", :float},
  {"NaturalGas(therms)", :float},
  {"NaturalGas(kBtu)", :float},
  {"TotalGHGEmissions", :float},
  {"GHGEmissionsIntensity", :float}]

df =
  DF.load_csv!(File.read!(Kino.FS.file_path("2016_Building_Energy_Benchmarking.csv")),
    delimiter: ",",
    dtypes: dtypes
  )

I get this : "Polars Error: could not parse 'NonResidential' as dtype 'f64' at column 'Electricity(kWh)' (column number 3)

Column number 3 should indeed store BuildingType and not Electricity(kWh).

:raised_eyebrow: :raised_eyebrow:

josevalim commented 3 months ago

I cannot reproduce the same failure as you for the dtypes above. Instead I get:

~/OSS/explorer[main %]$ MIX_ENV=test mix run foo.exs
** (RuntimeError) from_csv failed: Polars Error: could not parse `NULL` as dtype `f64` at column 'ENERGYSTARScore' (column number 29)

The current offset in the file is 864558 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `NULL` to the `null_values` list.

Original error: ```remaining bytes non-empty```
    (explorer 0.10.0-dev) lib/explorer/data_frame.ex:621: Explorer.DataFrame.from_csv!/2
    foo.exs:32: (file)

josevalim commented 3 months ago

FWIW, I can parse the CSV using these arguments:

    delimiter: ",",
    dtypes: %{"YearsENERGYSTARCertified" => :string, "TaxParcelIdentificationNumber" => :string, "Electricity(kBtu)" => :string},
    infer_schema_length: 20000

But the original issue has to be fixed in Polars anyway.

josevalim commented 3 months ago

It turns out there was a bug in Explorer as well, but there is a bug in Polars too.

elixir-explorer / explorer

Orders of names after CSV loading seems buggy since 9.0 #953