Flickr-Foundation / flickypedia

A tool to copy CC-licensed images from Flickr to Wikimedia Commons
https://www.flickr.org/tools/flickypedia/
Apache License 2.0
8 stars 2 forks source link

What issues are we running into in Backfillr? #389

Open alexwlchan opened 6 months ago

alexwlchan commented 6 months ago

This is a tracking ticket to highlight files (with examples!) where the bot is getting "confused" and doesn't know how to update the SDC.

alexwlchan commented 6 months ago

Unable to find Flickr photographer

Example: https://commons.wikimedia.org/wiki/File:%22Air_Cav!%22.jpg

This file links to https://www.flickr.com/photos/35703177@N00/28006848539/, but the photo page is a 404 and the user page is a 410.

In this case the bot adds the P12120 (Flickr photo ID) and P7482 (source of file statements), but it can't add any of the other Flickr metadata.

It might be useful to add the Flickr user ID in P170 (creator), but it's not essential.

alexwlchan commented 6 months ago

"Date taken" comes from the EXIF, not the Flickr metadata

Example: https://commons.wikimedia.org/wiki/File:%22Aircraft_revetments_constructed_from_empty_fuel_drums_at_Chu_Lai_-_September_1965.%22_-_49716360457.jpg

The date on WMC is 11 January 2018, 05:27:39, which is the created date in the EXIF of the JPEG file, but that's not when the photo was actually taken – more likely when it was digitised.

The Flickr photo has the actual date: September 1965.

I don't know how widespread this is, but this is the sort of thing we should fix.

alexwlchan commented 6 months ago

Videos are weird

Example: https://commons.wikimedia.org/wiki/File:Strawberries_time-lapse.ogv Example: https://commons.wikimedia.org/wiki/File:Lascar_VIDEO_-_Riding_the_Budavari_Siklo_to_the_Castle_Hill_top_(4543574073).jpg

This throws an exception when we try to retrieve the image info:

Traceback (most recent call last):
  File "/Users/alexwlchan/repos/flickypedia/.venv/bin/flickypedia", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/alexwlchan/repos/flickypedia/.venv/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/alexwlchan/repos/flickypedia/.venv/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/alexwlchan/repos/flickypedia/.venv/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/alexwlchan/repos/flickypedia/.venv/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/alexwlchan/repos/flickypedia/.venv/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/alexwlchan/repos/flickypedia/.venv/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/alexwlchan/repos/flickypedia/src/flickypedia/backfillr/cli.py", line 312, in update_single_file
    run_with(list_of_filenames=[filename])
  File "/Users/alexwlchan/repos/flickypedia/src/flickypedia/backfillr/cli.py", line 93, in run_with
    photo = flickr_api.get_single_photo(photo_id=photo_id)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/alexwlchan/repos/flickypedia/.venv/lib/python3.12/site-packages/flickr_photos_api/api.py", line 371, in get_single_photo
    "width": int(s.attrib["width"]),
             ^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''

This is a bug, because we don't even use the width value that's being extracted.

alexwlchan commented 6 months ago

"Date taken" has an incorrect level of precision

Example: https://commons.wikimedia.org/wiki/File:%22A_Welcome_Visitor_to_Camels%27_Paradise%22.jpg

The date on Flickr is circa 1922, but it's been mapped to WMC as 1 January 1922. This looks like a bug in the original migration tool – Flickr returns a 1 Jan timestamp in the "date taken" field and stores the granularity separately. This is something we should be able to fix automatically.

alexwlchan commented 6 months ago

This seems like a fairly obvious thing to do which I've only added now; I'm now tracking which properties have an unknown action and the associated files.