OneZoom / tree-build

Scripts for assembling the tree, metadata and downstream data products such as popularity and popular images
MIT License
1 stars 2 forks source link

Errors & restarting file generation #33

Closed hyanwong closed 7 months ago

hyanwong commented 7 months ago

I followed the instructions for creating the filtered files, and got:

INFO:root:Finished generating file Wiki/wd_JSON/OneZoom_latest-all.json
INFO:root:Generating file Wiki/wp_SQL/OneZoom_enwiki-latest-page.sql
Traceback (most recent call last):
  File "/Users/yan/mambaforge/bin/generate_filtered_files", line 33, in <module>
    sys.exit(load_entry_point('oz-tree-build', 'console_scripts', 'generate_filtered_files')())
  File "/Users/yan/Documents/GitHub/tree-build/oz_tree_build/utilities/generate_filtered_files.py", line 519, in main
    process_args(args)
  File "/Users/yan/Documents/GitHub/tree-build/oz_tree_build/utilities/generate_filtered_files.py", line 459, in process_args
    generate_all_filtered_files(
  File "/Users/yan/Documents/GitHub/tree-build/oz_tree_build/utilities/generate_filtered_files.py", line 434, in generate_all_filtered_files
    generate_and_cache_filtered_file(
  File "/Users/yan/Documents/GitHub/tree-build/oz_tree_build/utilities/generate_filtered_files.py", line 73, in generate_and_cache_filtered_file
    processing_function(original_file, clade_filtered_file, context)
  File "/Users/yan/Documents/GitHub/tree-build/oz_tree_build/utilities/generate_filtered_files.py", line 328, in generate_filtered_wikipedia_sql_dump
    with open_file_based_on_extension(
  File "/Users/yan/Documents/GitHub/tree-build/oz_tree_build/utilities/file_utils.py", line 20, in open_file_based_on_extension
    return open(filename, mode, encoding="utf-8")
FileNotFoundError: [Errno 2] No such file or directory: 'Wiki/wp_SQL/OneZoom_enwiki-latest-page.sql'
[2]  + exit 1     generate_filtered_files OZTreeBuild/AllLife/AllLife_full_tree.phy     

I think this is my fault, and I just need to add a missing file. But I don't want to have to re-run the wd_JSON filtering, which took a day on my machine. It's not clear to me whether or how I can rerun without repeating the steps that have already been done.

davidebbo commented 7 months ago

The code is written to be incremental by default. e.g. for the wikidata dump, it will look whether you already have a OneZoom_latest-all.json with a timestamp that matches latest-all.json.bz2, and if so it does not regenerate it (there is a force -f flag to override this).

Can you check under data\Wiki\wd_JSON to make sure that is indeed the case for you? e.g.

-rw-r--r-- 1 david david  1545451693 Jun 15 00:26 OneZoom_latest-all.json
-rw-r--r-- 1 david david 82551086015 Jun 15 00:26 latest-all.json.bz2

I just ran it on mine where everything is already filtered, and it took just 45 seconds, with the following output:

INFO:root:Using cached file EOL/OneZoom_provider_ids.csv
INFO:root:Using cached file Wiki/wd_JSON/OneZoom_latest-all.json
INFO:root:Using cached file Wiki/wp_SQL/OneZoom_enwiki-latest-page.sql
INFO:root:Using cached file Wiki/wp_pagecounts/OneZoom_pagecounts-2020-04-views-ge-5-totals
INFO:root:Using cached file Wiki/wp_pagecounts/OneZoom_pagecounts-2020-05-views-ge-5-totals
INFO:root:Using cached file Wiki/wp_pagecounts/OneZoom_pagecounts-2020-06-views-ge-5-totals
INFO:root:Using cached file Wiki/wp_pagecounts/OneZoom_pagecounts-2020-07-views-ge-5-totals
INFO:root:Using cached file Wiki/wp_pagecounts/OneZoom_pagecounts-2020-08-views-ge-5-totals
hyanwong commented 7 months ago

Ah, perfect. It may be worth noting this in the instructions?

hyanwong commented 7 months ago

(and yes, I do have both files)

davidebbo commented 7 months ago

Ah, perfect. It may be worth noting this in the instructions?

Indeed, and I just did add a paragraph in that section before the command.

hyanwong commented 7 months ago

My error was that I accidentally named the SQL directory wd_SQL not wp_SQL. I think it would be useful to commit a .gitignore in each of the wd and wp directories so that they are forced to exist on GH. What do you think @davidebbo ?

hyanwong commented 7 months ago

Ah, perfect. It may be worth noting this in the instructions?

Indeed, and I just did add a paragraph in that section before the command.

Great, thanks. Sorry if I missed that.

davidebbo commented 7 months ago

My error was that I accidentally named the SQL directory wd_SQL not wp_SQL. I think it would be useful to commit a .gitignore in each of the wd and wp directories so that they are forced to exist on GH. What do you think @davidebbo ?

Yes, that's definitely helpful. Added via #34