CorrelAid / pystatis

MIT License
8 stars 1 forks source link

Broken Tables #152

Open julianv95 opened 1 week ago

julianv95 commented 1 week ago

Hello everyone,

First of all, thank you for this great tool!

I wanted to test the tool and downloaded a table from the Regionalstatistik, but I noticed that various values are missing in this table.

t = Table(name="31231-02-01-4") t.get_data(prettify=True, regionalkey="07314,07316,08222") get_data() print(t.data)

Result: ` Stichtag Amtlicher Gemeindeschlüssel (AGS) Kreise und kreisfreie Städte Wohngebaeude__MeasureUnitNotFound! Wohnflaeche_in_Wohngebaeuden__1000_qm Wohnungen_in_Wohn-_und_Nichtwohngebaeuden__MeasureUnitNotFound! Raeume_in_Wohnungen_mit_7_und_mehr_Raeumen__Anzahl
2022-12-31 7314 Ludwigshafen am Rhein, kreisfreie Stadt NaN 7064.9 NaN 49358
2022-12-31 7316 Neustadt an der Weinstraße, kreisfreie Stadt NaN 2806.0 NaN 36645
2022-12-31 8222 Mannheim, Stadtkreis NaN 12883.4 NaN 67547

`

The root of the problem seems to be the flat-file export of the Regionalstatistik, as I also receive a completely broken table when I download it via the web GUI of the Regionalstatistik.

pmayd commented 1 week ago

Hi @julianv95 , thanks for using our library! Indeed, Regionalstatistik seems to have quite some problems, we also have tables that just return Code 6 and are not downloadable at all. I contacted the team behind the database via their official email address and it turns out that their ffcsv is broken, as you have already assumed. They said we have to wait for the next version (Genesis 5?). The only solution they have right now is to download the csv format, which our library does not support as it is even harder to parse than flatcsv. I am quite unhappy with the current state but it seems there is very little we can do. Not sure why this problem only came up now and when Genesis 5.0 will be available...

pmayd commented 1 week ago

But it would be great to learn/hear, what is missing from the table in your opinion. I just downloaded the ffcsv and besides the problem, that this is one of the tables where Regionalstatistik is returning all cities at the beginning of the file (which is a bug for me and which is already "cleaned up" by our package), the data should actually work and be there. So where do you see problems with the raw data? What I can see from the header is that Regionalstatistik is not returning the unit of the measurements, which happens in some cases and is also a bug in my opinion as each statistic should have a defined unit...

Ah I see now, it is not only the unit that is missing, it is actually the whole data because the column has only NaN values...interesting. So from our point of view in this case the package works as expected as we can download and parse the ffcsv file and as you said, sadly, the ffcsv is already the source of the problems as it does not provide the data. Maybe the data is in the csv format as I have learned from their support team that csv currently works better and ffcsv seem to have problems, but in the end I guess we have to wait for the next Genesis 5.0 update to have working data again

pmayd commented 1 week ago

So the csv export seems to work better, we just don't support it and looking at the raw data, I honestly don't have the slightest idea what I am seeing here:

GENESIS-Tabelle: 31231-02-01-4
Bestand an Wohngebäuden und Wohnungen in Wohn- und;;;;;;;;;;;;;;;;
Nichtwohngebäuden - Stichtag 31.12. - regionale Tiefe:;;;;;;;;;;;;;;;;
Kreise und krfr. Städte;;;;;;;;;;;;;;;;
Fortschreibung des Wohngebäude- und Wohnungsbestandes ;;;;;;;;;;;;;;;;
;;Wohngebäude;;;;;Wohnfläche in Wohngebäuden;Wohnungen in Wohn- und Nichtwohngebäuden;;;;;;;;Räume in Wohnungen mit 7 und mehr Räumen
;;Wohngebäude nach Anzahl der Wohnungen;;;;;;Größe der Wohnung;;;;;;;;
;;Insgesamt;Wohngebäude mit 1 Wohnung;Wohngebäude mit 2 Wohnungen;Wohngebäude mit 3 und mehr Wohnungen;Wohnheime;;Insgesamt;Wohnungen mit 1 Raum;Wohnungen mit 2 Räumen;Wohnungen mit 3 Räumen;Wohnungen mit 4 Räumen;Wohnungen mit 5 Räumen;Wohnungen mit 6 Räumen;Wohnungen mit 7 Räumen oder mehr;
;;Anzahl;Anzahl;Anzahl;Anzahl;Anzahl;1000 qm;Anzahl;Anzahl;Anzahl;Anzahl;Anzahl;Anzahl;Anzahl;Anzahl;Anzahl
31.12.2022;;;;;;;;;;;;;;;;
DG;Deutschland;19479501;13010370;3180141;3265912;23078;3870328,0;43366919;1526729;4133067;9434528;10875610;7307548;4786309;5303128;43066033

It seems like there are definitely more values/measurements than compared to the ffcsv, so yes, ffcsv is definitely broken, but the format of the csv is completely unreadable for me...regarding what are the columns and feature names