bundestag / gesetze-tools

Scripts to maintain German law git repository
GNU Lesser General Public License v3.0
114 stars 21 forks source link

🚀 [Feature] Implement github workflow to publish data daily #16

Open ulfgebhardt opened 3 years ago

ulfgebhardt commented 3 years ago

:rocket: Feature

Implement github workflow to publish data daily

Please help implement it - if you have the free time to do it we would solve a 3 year old problem which pops up every election year. Pinging capable and potentially interested people out of the blue: @Muehe @JBBgameich <3

User Problem

We would have plain text data here in github

https://github.com/bundestag/gesetze/issues/55

Implementation

Use github workflows. See examples:

https://github.com/Ocelot-Social-Community/Ocelot-Social/blob/master/.github/workflows/publish.yml https://github.com/gradido/gradido/blob/master/.github/workflows/publish.yml https://github.com/mattia-lerario/Mentor-Application-Bachelor-Project/blob/master/.github/workflows/test.yml#L23

Additional context

Also ideal wäre wenn man irgendwie ein "binary" erstellen bzw den interpreter drüberlaufen lassen könnte. Ich habe leider noch nie etwas ernsthaftes mit python gemacht, da heißt es glaube ich packages erstellen oder so.

Ich hab überhaupt nix dagegen das zu mergen. Das Repo ist aktuell ziemlich tot
image

ich bekomme allerdings mit der anstehenden Bundestagswahl mehr Anfragen auf https://github.com/bundestag/gesetze und soweit ich das verstanden habe ist dieses Repo zum generieren des Inhalts zuständig.

Siehe: bundestag/gesetze#59 und bundestag/gesetze#55
Heute habe ich eine Anfrage von @Muehe bekommen bezüglich des Repos.

Ich fände es cool wenn wir es gemeinschaftlich hinbekommen ein script für die github workflows zu schreiben um ähnlich zu den Repos die wir für das democracy Projekt crawlen regelmäßig updates bekommen.

https://github.com/bundestag/NamedPolls
https://github.com/bundestag/NamedPollDeputies
https://github.com/bundestag/ConferenceWeekDetails
https://github.com/bundestag/dip21-daten
Die Leute von https://github.com/bundestag/offenegesetze.de haben sich da leider noch nicht eingeklinkt um diese Aufgabe zu übernehmen.

Also @JBBgameich : Hast du bock sowas zu machen? Soll ich diesen PR nun mergen?

src

darkdragon-001 commented 3 years ago

Who maintains bundestag/gesetze? Who has pull/merge rights?

There are lots of open pull requests which haven't been merged yet. One should first get the manual workflow running before trying to automate things.

jbruechert commented 3 years ago

Most of the pull requests are either jokes, drafts or too large to review. Generating an up to date version from source is probably a better course of action.

ulfgebhardt commented 3 years ago

I sorta start taking the responsibility since people come to me and ask for the Repo. Tho I have nothing to do with it. My course of action is finding people who wanna do it. I have all the rights needed and can also propagate those rights. I invite people to the orga if they have a commit on a repo in the Orga or a featured fork. This should allow you to have more rights - not sure the merge right is set tho.

So if you wanna do the automatic push thingy, we can certainly make that happen rightwise.

darkdragon-001 commented 3 years ago

Anyone has an idea how to efficiently determine the changed laws since the last run?

While it is easy for the scrapers (BGBl, BAnz, ...) since they are ordered by date, it is not so easy for the laws. There is Aktualitätendienst which can be mapped to the corresponding entries in the scraped data based on page number, but I don't see how this can determine which laws (name or slug) actually changed. Anyone has an idea?

darkdragon-001 commented 12 months ago

I am wondering if it makes sense to use https://github.com/actions/cache for storing the json data instead of committing it to some repo as it is fully generated. @ulfgebhardt do you have an opinion here?

ulfgebhardt commented 12 months ago

I believe that it is worthwhile to store all data in a repo - that way we would make the changes of laws transparent and searchable.

Why would we hide the actual content in some volatile cache? I do not really understand the benefits. Furthermore the actual content we provide is the scraped data - we should ensure maximum visibility and transparency.

But thats all just an opinion ;)

darkdragon-001 commented 12 months ago

I don't like the fact that tooling and data is mixed in this repository. Also using and updating the cache seems just easier. I also don't see any added benefit by storing this data as it is fully reproducible and verifiable by anyone. No strong objection, just my personal opinion.

ulfgebhardt commented 12 months ago

Tooling happens here: https://github.com/bundestag/gesetze-tools Data happens here: https://github.com/bundestag/gesetze

The data is not reproducable since the official websites do not provide a history, do they?

darkdragon-001 commented 12 months ago

I am talking about the intermediate JSON files stored in https://github.com/bundestag/gesetze-tools/tree/master/data. I agree that the final Markdown files should be committed via Git to the other repository.

ulfgebhardt commented 12 months ago

Ok then I missunderstood

mk-pmb commented 12 months ago

Hi! Sorry for being late to the party.

I am wondering if it makes sense to use https://github.com/actions/cache for storing the json data

Don't cache, always publish. If the data helps for our next automated run, usually it will also help humans with their next manually-invoked run. For data where git can make meaningful useful diffs, pushing it to a repo is a good idea. For all other stuff, let's instead make it part of a "release" = GitHub-hostet blob download.

I don't like the fact that tooling and data is mixed in this repository.

Yes, we should strictly separate both.

I had a quick look at gesetze-tools and see several python scripts. I assume they need to run in a temporary clone of the gesetze repo, right? From the readme I see lawde.py and lawdown.py have to run chained. Can the others run in parallel, each in their own gesetze clone (probably with working directory set to repo root?), or do some of them depend on another's results? Will some of them conflict when run in parallel but using the same (shared) gesetze clone? What files do I need to collect and publish from which of the tools?

mk-pmb commented 12 months ago

Edit: Moved to #36

mk-pmb commented 12 months ago

Also it would be nice to have a small dummy version of the data repo, with all important structures at the latest version but much faster to clone. Or can I just pick an ancient commit? My hope is to make quick test runs for debugging that will probably produce wrong results but can give a preview of whether it would have worked when using the real data repo.