earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
763 stars 75 forks source link

filter_external_links truncates colon at the end of URL #333

Open harej opened 2 months ago

harej commented 2 months ago

Test case:

import mwparserfromhell

wikitext = """
<ref>{{cite news | first=109th Congress, 1st Session | last=U.S. Senate |  title= S. 1033, Secure America and Orderly Immigration Act | date=[[May 12]] [[2005]] |  url =http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033: | work =Thomas |  accessdate = 2007-09-30 | }}</ref>
"""

parsed = mwparserfromhell.parse(wikitext)
parsed.filter_external_links()

What I get: ['http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033'] What I should get: ['http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033:'] with the colon at the end

harej commented 2 months ago

Yes, that's a valid URL, or at least it was nearly 20 years ago. https://web.archive.org/web/20080918055001/http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033:

(You may need to copy that URL with the colon into the address bar manually)

lahwaacz commented 2 weeks ago

See the difference here:

import mwparserfromhell

wikitext1 = "http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033:"
wikitext2 = "[http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033: foo]"

parsed1 = mwparserfromhell.parse(wikitext1)
parsed2 = mwparserfromhell.parse(wikitext2)
print(parsed1.filter_external_links())
print(parsed2.filter_external_links())

Which gives

['http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033']
['[http://thomas.loc.gov/cgi-bin/bdquery/z?d109:SN01033: foo]']

Note that this is consistent with how MediaWiki behaves :shrug:

For your snippet, the thing is that mwparserfromhell does not expand templates so it can't know that the url parameter is actually used inside square brackets.