Closed hyanwong closed 7 months ago
The code is written to be incremental by default. e.g. for the wikidata dump, it will look whether you already have a OneZoom_latest-all.json
with a timestamp that matches latest-all.json.bz2
, and if so it does not regenerate it (there is a force -f
flag to override this).
Can you check under data\Wiki\wd_JSON
to make sure that is indeed the case for you? e.g.
-rw-r--r-- 1 david david 1545451693 Jun 15 00:26 OneZoom_latest-all.json
-rw-r--r-- 1 david david 82551086015 Jun 15 00:26 latest-all.json.bz2
I just ran it on mine where everything is already filtered, and it took just 45 seconds, with the following output:
INFO:root:Using cached file EOL/OneZoom_provider_ids.csv
INFO:root:Using cached file Wiki/wd_JSON/OneZoom_latest-all.json
INFO:root:Using cached file Wiki/wp_SQL/OneZoom_enwiki-latest-page.sql
INFO:root:Using cached file Wiki/wp_pagecounts/OneZoom_pagecounts-2020-04-views-ge-5-totals
INFO:root:Using cached file Wiki/wp_pagecounts/OneZoom_pagecounts-2020-05-views-ge-5-totals
INFO:root:Using cached file Wiki/wp_pagecounts/OneZoom_pagecounts-2020-06-views-ge-5-totals
INFO:root:Using cached file Wiki/wp_pagecounts/OneZoom_pagecounts-2020-07-views-ge-5-totals
INFO:root:Using cached file Wiki/wp_pagecounts/OneZoom_pagecounts-2020-08-views-ge-5-totals
Ah, perfect. It may be worth noting this in the instructions?
(and yes, I do have both files)
Ah, perfect. It may be worth noting this in the instructions?
Indeed, and I just did add a paragraph in that section before the command.
My error was that I accidentally named the SQL directory wd_SQL
not wp_SQL
. I think it would be useful to commit a .gitignore in each of the wd and wp directories so that they are forced to exist on GH. What do you think @davidebbo ?
Ah, perfect. It may be worth noting this in the instructions?
Indeed, and I just did add a paragraph in that section before the command.
Great, thanks. Sorry if I missed that.
My error was that I accidentally named the SQL directory
wd_SQL
notwp_SQL
. I think it would be useful to commit a .gitignore in each of the wd and wp directories so that they are forced to exist on GH. What do you think @davidebbo ?
Yes, that's definitely helpful. Added via #34
I followed the instructions for creating the filtered files, and got:
I think this is my fault, and I just need to add a missing file. But I don't want to have to re-run the wd_JSON filtering, which took a day on my machine. It's not clear to me whether or how I can rerun without repeating the steps that have already been done.