bundestag / gesetze-tools

Scripts to maintain German law git repository
GNU Lesser General Public License v3.0
113 stars 21 forks source link

Draft: Update data on the CI #26

Open jbruechert opened 3 years ago

ulfgebhardt commented 3 years ago
  1. The update & push to github must become its own workflow e.g .github/workflows/crawl.yml so we can separate stuff and schedule the job. Ofc what you do is perfectly fine for testing.
name: "Close stale issues"
on:
  schedule:
  - cron: "0 0 * * *"

jobs:
  stale:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/stale@v1.1.0
      with:
        repo-token: ${{ secrets.GITHUB_TOKEN }}
        stale-issue-message: 'Message to comment on stale issues. If none provided, will not mark issues stale'
        stale-pr-message: 'Message to comment on stale PRs. If none provided, will not mark PRs stale'

src

  1. If you need secrets in the repo to push to the correct git repo, tell me and I make things happen

  2. Is it smart to keep data like banz.json in this repo?

  3. ....

  4. Profit! Good work, very awesome to see this year old problem to get solved <3

jbruechert commented 3 years ago

Is it smart to keep data like banz.json in this repo?

So far we are treating this data as input data for our other tools. If someone has a use for it outside of this we can move it to a new repo I guess.

ulfgebhardt commented 3 years ago

Oh I did not know that this tool processes the crawled data further. Its cool, I just thought for separation of code & data it might be wise to do so. If its not thats perfectly fine.

jbruechert commented 3 years ago

Before we can merge this we need to fill in the years between 2013 and 2021 with a manual run locally btw. The script currently only starts at the year of the latest commit in the gesetze repo, which is from this year, so it would get confused by that.

ulfgebhardt commented 3 years ago

I guess this is to increase efficiency? How long is the time effort for a whole scrape? My experience with bundestag.de is that they change old stuff regularly and their unstable web interfaces cause faulty data to show up every now and then - this can be detected with regular crawls. So what speaks against crawling everything once a day?

jbruechert commented 3 years ago

The Bundesanzeiger scraper is really slow. As far as I understand the code, the actual laws are updated completely, but the index of changed laws is only extended, not refreshed. Even crawling just 2021 in the Bundesanzeiger takes about 5 Minutes iirc, and everything would take multiple hours.

Croydon commented 3 years ago

What is the status of this? Any particular reason why the work in this PR did not get continued?

jbruechert commented 3 years ago

The data from 2013 and 2021 needs to be added (for example by editing updatelawsgit.py locally). Afterwards I think this could be merged. It would be nice if someone else could take over that work, I'm not that active here right now.