CXuesong / MwParserFromScratch

A basic .NET Library for parsing wikitext into AST.
Apache License 2.0
18 stars 5 forks source link

External link not parsed #13

Closed RobSchoenaker closed 4 years ago

RobSchoenaker commented 4 years ago

Example: [[Bestand:Bundesarchiv Bild 146III-373, Modell der Neugestaltung Berlins ("Germania").jpg|miniatuur|260px|right| Schaalmodel van de [[Welthauptstadt Germania]], 1939]]

This is a link on this particilar page: https://nl.wikipedia.org/wiki/Albert_Speer

With the code var ast = LoadAndParse(fileName.Trim(' ', '\t', '"')); var text = ast.ToPlainText(NodePlainTextOptions.RemoveRefTags);

I would expect the text to read: Schaalmodel van de Welthauptstadt Germania, 1939

I have been trying to get this sorted, but I am kind of lost in the code...

CXuesong commented 4 years ago

I think there is currently a lack of parsing rule for images (File: namespace). I'll try adding it and hopefully make it done before end of week.

RobSchoenaker commented 4 years ago

Would it be an idea to have an option for including the specific namespaces? These are language-dependant.

CXuesong commented 4 years ago

Well, that's a good point! I think I will go on with a new configuration for you to specify such namespace names.

Btw, there is a related discussion on earwig/mwparserfromhell#136 .

RobSchoenaker commented 4 years ago

Read the discussion. Same issue indeed. I think it would make sense to have a static language class for these situations. I can provide the Dutch version based on my findings on all WikiPedia articles.

CXuesong commented 4 years ago

The updated ETA is before end of next week 😂

CXuesong commented 4 years ago

Published v0.3.0-int.3.

See the following snippet for an example on how to customize namespace prefixes used as File: namespace with WikitextParserOptions. The presets are ["File", "Image"]. https://github.com/CXuesong/MwParserFromScratch/blob/f0dac824c8d91f58ffa18425262d153f323b36bd/UnitTestProject1/BasicParsingTests.cs#L158-L162

Additionally, you may use CanonicalName, CustomName, and Aliases provided in WikiClientLibrary.Sites.NamespaceInfo to retrieve the valid live namespace names on a MW site, if you are using WikiClientLibrary.

using WikiClientLibrary;
using WikiClientLibrary.Client;
using WikiClientLibrary.Sites;

var client = new WikiClient();
var endpointUrl = await WikiSite.SearchApiEndpointAsync(client, "nl.wikipedia.org")
var site = new WikiSite(client, endpointUrl);
await site.Initialization;

site.Namespaces[BuiltInNamespaces.File]

image

RobSchoenaker commented 4 years ago

This is perfect. I will complete this for the Dutch (NL) WikiPedia as I find the namespaces. Will take some time though :)