bibanon / tubeup

Use yt-dlp to download video and upload to the Internet Archive with metadata.
https://pypi.python.org/pypi/tubeup/
GNU General Public License v3.0
407 stars 70 forks source link

Proposal: Identify core/essential metadata and add upload safeties for missing MD #279

Open vxbinaca opened 1 year ago

vxbinaca commented 1 year ago

See title.

Twitch extractor currently does not add channel metadata, TikTok though broken also did the same. This proposal is aimed at Youtube.

brandongalbraith commented 1 year ago

@vxbinaca What Youtube metadata are we currently not uploading to IA that yt-dlp is able to extract?

vxbinaca commented 1 year ago

@vxbinaca What Youtube metadata are we currently not uploading to IA that yt-dlp is able to extract?

If it's able to be extracted it's in JSON. I'm talking about minimum metadata for items. Harkening back to the recent uploader_ID deficiency in yt-dlp, there was a similar one 5 or so years ago where the extractor was briefly not setting the creator value.

What I'm saying is, we'd need creator, video URL, and the time to be present to make a valid item where creator isn't tubeup.py. There should be a safety to prevent item creation if the creator metadata isn't present and so on and so on.

Not all metadata needs to have a item creation halt, but what I'm saying is it would be helpful to identify what is core metadata and insert safeties to prevent uploading if it's missing.

brandongalbraith commented 1 year ago

After thinking about this for a bit, I'm thinking about erring on being conservative from a cultural preservation perspective. As long as enough metadata exists to upload an item (ie unique identifier from the service the content is being retrieved from), tubeup should continue on so the artifacts are preserved. If metadata can be derived in the future from other data sources (or programatic analysis of the uploaded artifacts), so be it.

vxbinaca commented 1 year ago

Given the issues with live chat extraction increasingly (broken on Youtube live videos IIRC, but for sure Twitch), should live chat be considered a core metadata like manual subtitles (auto subs suck)?

Are channel URLs a core metadata? Twitch doesn't give them it gives the video URL instead.

mrpapersonic commented 1 month ago

should live chat be considered a core metadata like manual subtitles (auto subs suck)?

If we're considering live chat to be metadata that is in-scope for tubeup to handle, then yes. Though I would argue that live chat should be in the same realm of comments, as in it's not really our problem to deal with, since realistically we should only be handling the video itself and its surrounding metadata.

Are channel URLs a core metadata?

Only on platforms where that URL cannot be changed at will by the user (see: youtube and channel IDs). In other cases where the user is able to change the URL/ID at will it's not very useful at all. In fact, now tubeup uses the stupid channel handles youtube added which is actually fairly annoying in itself.

p.s. sorry for being like, a year late x)

vxbinaca commented 1 month ago

No it's fine. What of extractors that didn't provide (but I think do now) like BilliBilli like channel URLs?

Edit: With youtube livechat I believe thats extracted into JSON, but Twitches is broken with yt-dlp. Do all sites have live chat? No. Do all of them have creator metadata? Yes - mostly.

A current example of this is the OnlyFans TV extractor which is all kinds of messed up right now and not routing metadata properly. Take any OFTV video and try to rip it.

mrpapersonic commented 1 month ago

What of extractors that didn't provide (but I think do now) like BilliBilli like channel URLs?

I'm not sure really. It's likely best though to not consider channel URLs as particularly important. We should warn if a URL could not be found though so the user can manually fix it and send a report to yt-dlp to fix/expand the extractors.

Take any OFTV video and try to rip it.

no offense but I wouldn't touch that website with a ten foot pole lmao

vxbinaca commented 4 weeks ago

OFTV is all non-nude content it's public facing and it's exclusive, lots of cooking classes or game stream clips.

vxbinaca commented 2 weeks ago

@mrpapersonic so the issue is we put fail-safe in place years ago to prevent blank creator items being rejected by IA. The creator tag is blank due to some extractors (LBRY/Odysee, Instagram posts containing multiple videos, many other extractors) misdirecting or not having the Creator tag. This failsafe was "tubeup.py" being put into the creator tag. The alternative was item creation failure and the video and metadata being stuck on the users disk.

What I want to do is try to find permutations of the creator tag and account for it, using a few other extractors as a test case to build whats breaking, because I've been (rightly) telling people for years that the problem isn't us it's the extractor not naming or accounting for metadata properly.

Anyway last bump on this for a while:

I want to define the minimum metadata elements we need to upload an item, so I can go to yt-dlp people and ask them to modify their standard for accepting an extractor, and then I or a few of us can go and find what extractors are broken and flag them to meet those minimum standards. Either that or we use else if to try another meta element we think will correctly tag a item. This way, theres less problems in the future like OFTV or the minor problems with Odysee or Instgram extractors.