Broaden regex to handle other URLs

aakatov / anki-media-internalizer

This Anki addon finds http references, downloads files into the internal local storage and updates the references.

https://ankiweb.net/shared/info/221033553

8 stars 2 forks source link

Broaden regex to handle other URLs #3

Closed mikehardy closed 6 years ago

mikehardy commented 6 years ago

AnkiDroid had a user file an issue - ankidroid/Anki-Android/issues/4741 - where an URL like [sound:http://blah.blah.com/smoothly.mp3?type=1] didn't sync correctly, and the root cause was that AnkiDesktop HTML-encoded the URL (instead of URL-encoding it).

However it's possible to take these notes synced from AnkiDroid and run the media internalizer on them, but only if the regex matches them.

When changed to [sound:(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)[^>]] (apologies if MarkDown mangles that - but you can see it at the end of the related AnkiDesktop issue also media internalizer worked just fine

It seems that media internalizer could be even more useful than it is if it handled more classes of URLs such as this sound: one?

aakatov commented 6 years ago

Ok, I will do it. Is "[sound]" the only tag that should be added? I looked through the discussions about this issue and it seems like my addon also needs to unescape HTML in the URL before processing it. I did a test and it looks like urls it gets from cards are HTML escaped. Thanks for pointing this out.

mikehardy commented 6 years ago

That's cool, great! I think just images and sound as media. The urls may also be unescaped as ankidroid does not post process edits in the fields

aakatov commented 6 years ago

There is a bug in Anki that stops me from implementing this. Anki doesn't play an audio file if it has "&" (and may be other special symbols) in its filename. Reproduction Steps:

Save http://dict.youdao.com/dictvoice?audio=smoothly&type=1 locally under the name "audio&.mp3"
Insert it in a note (press F3).
Save a note.
Go to the deck and study it or open this card in Browser.

Expected result: Anki plays sound Actual result: Anki doesn't play sound

Media Internalizer relies on Anki method MediaManager.writeData for saving media internally. writeData returns a filename under which the file was stored. For the URL http://dict.youdao.com/dictvoice?audio=smoothly&type=1 it creates a file "dictvoiceaudio=smoothly&type=1". The addon replaces "[sound:http://dict.youdao.com/dictvoice?audio=smoothly&type=1]" by "[sound:dictvoiceaudio=smoothly&type=1]" that cannot be played because of this bug.

mikehardy commented 6 years ago

Man, that's irritating - I see what you mean now - my own personal edit of the regex had internalized the file but now I see that I didn't test enough, it's got the same problem you describe (doesn't play) and it's because of this same HTML-encoding issue (it's a '\&' in the raw HTML if you look)

The only thing I can think of is to have media-internalizer call a URL-encode function prior to asking Anki to writeData - it looks to me like URL-encoded URLs (at least this one) survive a secondary HTML-encoding without change, so this may work?

mikehardy commented 6 years ago

BTW - this is specifically NOT considered a bug in Anki, because the "field" in a "note" is an internal portion of an HTML document, they actually are required to HTML-encode fields and things that will go in fields. That was what I gathered from the bug I linked as "the related AnkiDesktop issue" above. While irritating to me only from the perspective of wanting it to work easily, I think their stance on the issue - that they must HTML-encode - is correct, thus the suggestion of pre-URL-encoding in order to generate filenames that can survive HTML-encoding

aakatov commented 6 years ago

I did it. The addon just strip off any query string in a filename. So, for http://dict.youdao.com/dictvoice?audio=smoothly&type=1 it would be just "dictvoice".

I think that is a bug. "&" is not a reserved character in Windows and Unix filesystems, so it can be used in filenames. User can place such file onto a card by using standard Anki attach media mechanism. But after that it's broken. Probably, Anki should html-decode a file path before playing it.

aakatov commented 6 years ago

I created pull request https://github.com/dae/anki/pull/218 that fixes this bug.