datenguide / datenguide-api

datenguide GraphQL API Server
MIT License
10 stars 2 forks source link

Metadata: Different kinds of missingness #171

Open dpprdan opened 4 years ago

dpprdan commented 4 years ago

Destatis reports different kinds of missingness on regionalstatistik.de (e.g. 0 apparently does not mean zero).

grafik

For example regionalstatistik.de reports . for statistic 61511:BAU001 (Veräußerungsfälle von Bauland), region Flensburg (01001), year 2018.

grafik

api.datengui.de and tabular.genesapi.org report this as 0.

r <- purrr::partial(read.csv, colClasses = "character")
r("https://tabular.genesapi.org/?data=61511:BAU001&time=2018&region=01001&labels=id")
#>   region_id year measure value statistic
#> 1     01001 2018  BAU001     0     61511

Would it be possible to report different kinds of missingness? And is this even desirable, given that one would have to switch from numeric to character values for example? If not, should there be at least a distinction between NULL and 0?

dpprdan commented 4 years ago

The Genesis SOAP API returns not only a value but also a quality indicator. The SOAP API query that corresponds to the example above is:

https://www.regionalstatistik.de/genesisws/services/ExportService_2010?method=DatenExport&kennung=USERNAME&passwort=PASSWORD&namen=61511KJ001&bereich=Alle&format=csv&werte=true&metadaten=false&zusatz=false&startjahr=2018&endjahr=2018&zeitscheiben=&inhalte=&regionalmerkmal=&regionalschluessel=01001&sachmerkmal=&sachschluessel=&sachmerkmal2=&sachschluessel2=&sachmerkmal3=&sachschluessel3=&stand=&sprache=en

which returns the following quaderDaten:

<quaderDaten>
* Der Benutzer USERNAME der Benutzergruppe USERNAME hat am 28.02.2020 um 16:37:34 diesen Export angestossen. K;DQ;FACH-SCHL;GHH-ART;GHM-WERTE-JN;GENESIS-VBD;REGIOSTAT;EU-VBD;"mit Werten" D;61511KJ001;;N;J;N;N K;DQ-ERH;FACH-SCHL D;61511 K;DQA;NAME;RHF-BSR;RHF-ACHSE D;KREISE;1;1 K;DQZ;NAME;ZI-RHF-BSR;ZI-RHF-ACHSE D;JAHR;2;2 K;DQI;NAME;ME-NAME;DST;TYP;NKM-STELLEN D;BAU001;Anzahl;GANZ;FALL;0 D;BAU002;1000 qm;GANZ;FALL;0 D;BAU003;Tsd. EUR;GANZ;FALL;0 D;BAU004;EUR;FEST;FALL;2 K;QEI;FACH-SCHL;ZI-WERT;WERT;QUALITAET;GESPERRT;WERT-VERFAELSCHT D;01001;2018;0;.;;0;0;.;;0;0;.;;0;0.00;.;;0.00
</quaderDaten>

Notice the .s on the last line, which are the quality indicators.

The potential categories of the quality indicator are available via the SOAP API as well (ZeichenKatalog). Most of these categories represent (different reasons for) missingness.

If the quality indicator is present, the value is 0 - and not NULL. This is why it is necessary (IMHO) for datengui.de to return the quality indicator as well (optionally on request).

To make things a bit more ... interesting, some missing values are indeed missing (deliberately) from the SOAP API, but not from the web interface. The following is the same query for another region (AGS 08316):

https://www.regionalstatistik.de/genesisws/services/ExportService_2010?method=DatenExport&kennung=USERNAME&passwort=PASSWORD&namen=61511KJ001&bereich=Alle&format=csv&werte=true&metadaten=false&zusatz=false&startjahr=2018&endjahr=2018&zeitscheiben=&inhalte=&regionalmerkmal=&regionalschluessel=08316&sachmerkmal=&sachschluessel=&sachmerkmal2=&sachschluessel2=&sachmerkmal3=&sachschluessel3=&stand=&sprache=en

which returns

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:Body>
<DatenExportResponse soapenv:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<DatenExportReturn>
<quader xmlns:ns1="daten.methods.webservice.genesis" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/" soapenc:arrayType="ns1:Quader[1]" xsi:type="soapenc:Array">
<quader>
<format/>
<name>61511KJ001</name>
<quaderDaten/>
<returnInfo>
<code>61</code>
<inhalt>There are no values available.</inhalt>
<typ>Fehler</typ>
</returnInfo>
<stand/>
<status>Aktualisierte Daten</status>
</quader>
</quader>
<quaderAuswahl>
<bereich>Alle</bereich>
<namen>61511KJ001</namen>
</quaderAuswahl>
<quaderOptionen>
<endjahr>2018</endjahr>
<format>csv</format>
<inhalte/>
<metadaten>false</metadaten>
<regionalMerkmal/>
<regionalSchluessel>08316</regionalSchluessel>
<sachMerkmal/>
<sachMerkmal2/>
<sachMerkmal3/>
<sachSchluessel/>
<sachSchluessel2/>
<sachSchluessel3/>
<sprache>en</sprache>
<stand/>
<startjahr>2018</startjahr>
<werte>true</werte>
<zeitscheiben>0</zeitscheiben>
<zusatz>false</zusatz>
</quaderOptionen>
<returnInfo>
<code>1</code>
<inhalt>
At least one object has reported a warning or error.
</inhalt>
<typ>Information</typ>
</returnInfo>
</DatenExportReturn>
</DatenExportResponse>
</soapenv:Body>
</soapenv:Envelope>

If one omits the 08316 Regionalschlüssel from the query above, it returns values for all available regions but nothing for 08316.

If one specifies a different startjahr, e.g. 2017, it returns data for 2017 but not for 2018.

According to the Team Regionaldatenbank Deutschland this is the intended behaviour.

The web interface ("Abruftabellen"), however, returns the following, i.e. it does not omit the data for 2018:

grafik

The latter, i.e. the SOAP API not returning any values for the query above, probably does not have any real consequences for (api.)datengui.de. At the moment it leads to an Internal Server Error on tabular.genesapi.org though:

https://tabular.genesapi.org/?data=61511:BAU001&time=2018&region=08316

So from my point of view both api.datengui.de and tabular.genesapi should return the quality parameter (one request). And tabular.genesapi should be able to handle empty responses instead of throwing an error.

sjockers commented 4 years ago

Hi Daniel, thanks for the detailed description of this problem and sorry for the extremely slow reply. We will definitely support quality indicators and other "footnotes". We are currently in the process of figuring out how we will represent this in the API. We will let you know once we know more.