OneZoom / tree-build

Scripts for assembling the tree, metadata and downstream data products such as popularity and popular images
MIT License
1 stars 2 forks source link

Armadillos, ratites, and pill bugs: feedback #74

Open hyanwong opened 2 months ago

hyanwong commented 2 months ago

I tried experimenting with the automatic wiki harvester, using armadillos as a test case:

get_wiki_images clade data/Wiki/wd_JSON/OneZoom_latest-all.json 847764

Here are the pictures. They aren't quite as good quality as I would have hoped for, but that might be a reflection on the unusualness of the taxon. There are some better wikimedia images (e.g. ), but it does appear that some hand curation might be needed for some of these non-european groups. I guess the main question is whether assigning all these image a value of 35000 will displace existing, better Onezoom images on the tree:

75070 111846 148752 203033 244043 649549 743510 752691 902876 968416 1042139 1052814 1761577 1764523

hyanwong commented 2 months ago

Here's another test using get_wiki_images clade OneZoom_latest-all.json Palaeognathae For comparison, here's what we have for that clade on OneZoom:

Screenshot 2024-07-02 at 17 11 57

A lot of these images don't have the artist/author information in a format we can ingest easily, e.g.

WARNING:get_wiki_images.py:Artist not found for 'Crypturellus_duidae.JPG': using 'Unknown artist' WARNING:get_wiki_images.py:Artist not found for 'Crypturellus_obsoletus.jpg': using 'Unknown artist' WARNING:get_wiki_images.py:Artist not found for 'Crypturellus_strigulosus.jpg': using 'Unknown artist' WARNING:get_wiki_images.py:Artist not found for 'Tinamus_solitarius.jpg': using 'Unknown artist' WARNING:get_wiki_images.py:Artist not found for 'Tinamus_guttatus.JPG': using 'Unknown artist' WARNING:get_wiki_images.py:Artist not found for 'Crypturellus_parvirostris.JPG': using 'Unknown artist' WARNING:get_wiki_images.py:Artist not found for 'Crypturellus_noctivagus.JPG': using 'Unknown artist' WARNING:get_wiki_images.py:Artist not found for 'Crypturellus_undulatus.JPG': using 'Unknown artist' WARNING:get_wiki_images.py:Artist not found for 'Nothura_minor.jpg': using 'Unknown artist'

These usually have e.g. "Given to the wikipedia by the author, Renato Caniatti" or something similar written on the page. I assume that someone will figure out a way to make this a bit more machine readable, and we just have to wait until this is sorted.

My impression is that the wiki images are of roughly the same quality on average (maybe very slightly better) than what we have, but that the image rating of the existing images means that our existing image stock is probably a bit more useful, because we can pick the ones we know to be high quality for percolating upwards in the tree.

1265542 1266282 1267803 1268546 1270229 1270445 1271939 3501588 11179834 1262693 1264146 1017426 1089078 1262031 933682 935204 971229 998385 870092 870099 916853 847190 849184 852723 860730 730091 733120 733485 734829 742093 790940 793068 793573 834414 843182 843185 843266 375790 388464 428441 510118 602080 609761 667778 17592 93208 192044 244197 248520 251765

hyanwong commented 2 months ago

Finally, here are pill bugs (get_wiki_images clade OneZoom_latest-all.json Armadillidiidae). OneZoom only have 2 images in this taxon, so the 13 images that we can get from wikidata is a distinct improvement, and the pictures are all pretty good quality, I think:

1300629 1646723 1813667 1891394 1927292 2126857 2224976 2331928 2585227 2610354 2682942 2946078 3433338