Open ulfgebhardt opened 3 years ago
I am not convinced by this proposal because of the following reasons (experience mostly on my work in BGBl
scraper parsing a table of contents tree):
url
: There are different URLs: web page, pdf, json dataid
: There are different IDs: toc id, doc id, ...meta
vs data
: I don't see which value this extra layer of indirection providesFurthermore, different data needs to be handled differently anyways and one has to thoroughly identify what which field means anyway when working with the data. Properties with too generic names often lead to false assumptions when interpreting data.
Nevertheless, the output data structure should be documented in the README.md
.
:zap: Refactor ticket
We should store the raw data (json) in a standard format. I propose the one generated by the https://github.com/bundestag/scapacra-bt scraper tool. All data objects have the following structure in common:
So basically the object is split into a
meta
and adata
part. Themeta
part holds at least anurl
field whiledata
definies at least anid
field.Motive
We want to unify stuff and make it easy to understand and parse. Maybe you want to match laws with deputies and named polls or what not. Having a similar structure might help people do that.
Additional context
Example: https://github.com/bundestag/DeputyProfiles/blob/master/data/517818.json