Open jbruechert opened 3 years ago
Is it smart to keep data like banz.json in this repo?
So far we are treating this data as input data for our other tools. If someone has a use for it outside of this we can move it to a new repo I guess.
Oh I did not know that this tool processes the crawled data further. Its cool, I just thought for separation of code & data it might be wise to do so. If its not thats perfectly fine.
Before we can merge this we need to fill in the years between 2013 and 2021 with a manual run locally btw. The script currently only starts at the year of the latest commit in the gesetze repo, which is from this year, so it would get confused by that.
I guess this is to increase efficiency? How long is the time effort for a whole scrape? My experience with bundestag.de is that they change old stuff regularly and their unstable web interfaces cause faulty data to show up every now and then - this can be detected with regular crawls. So what speaks against crawling everything once a day?
The Bundesanzeiger scraper is really slow. As far as I understand the code, the actual laws are updated completely, but the index of changed laws is only extended, not refreshed. Even crawling just 2021 in the Bundesanzeiger takes about 5 Minutes iirc, and everything would take multiple hours.
What is the status of this? Any particular reason why the work in this PR did not get continued?
The data from 2013 and 2021 needs to be added (for example by editing updatelawsgit.py locally). Afterwards I think this could be merged. It would be nice if someone else could take over that work, I'm not that active here right now.
.github/workflows/crawl.yml
so we can separate stuff and schedule the job. Ofc what you do is perfectly fine for testing.src
If you need secrets in the repo to push to the correct git repo, tell me and I make things happen
Is it smart to keep data like
banz.json
in this repo?....
Profit! Good work, very awesome to see this year old problem to get solved <3