appledora / mwparserfromhtml

An unofficial mirror of our repo of the `mwparserfromhtml` package. It is a python library for working with the HTML dumps. Since this is only a mirror, DO NOT PR.
https://pypi.org/project/mwparserfromhtml/
MIT License
4 stars 0 forks source link

Determine the markers used for transcluded elements #10

Closed appledora closed 2 years ago

appledora commented 2 years ago

Some transclusions links, don't have any marker for us to identify them and have the following format (the same as a standard WikiLink) : <a href="./Dictionary_of_National_Biography" rel="mw:WikiLink" title="Dictionary of National Biography"> Dictionary of National Biography </a>

Corresponding article : William Clark

In the same article, <a class="mw-disambig" href="./William_Clark_(disambiguation)" rel="mw:WikiLink" title="William Clark (disambiguation)"> William Clark (disambiguation) </a> - is both a disambiguation and a transclusion. The class attribute mw-disambig helps us identify the disambiguation, but not the transclusion.

In a closer inspection, it seems we need to look at the context in which the link is placed. i.e:

<div about="#mwt1" class="hatnote navigation-not-searchable" id="mwAw" role="note">
    For other people named William Clark, see
    <a class="mw-disambig" href="./William_Clark_(disambiguation)" rel="mw:WikiLink" title="William Clark (disambiguation)">
     William Clark (disambiguation)
    </a>
    .
</div>

For the same element, if we consider it's parent div tag, we see that it has a role = "note" and class=hatnote. This is preceded by a style-tag which likely performs the actual transclusion of the item.

<style about="#mwt1" data-mw='{"parts":[{"template":{"target":{"wt":"Other people","href":"./Template:Other_people"},"params":{"1":{"wt":"William Clark"}},"i":0}}]}' data-mw-deduplicate="TemplateStyles:r1033289096" id="mwAg" typeof="mw:Extension/templatestyles mw:Transclusion">
    .mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}
   </style> 

For reference, check thread : https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/8#note_9522

appledora commented 2 years ago

image

appledora commented 2 years ago

In GitLab by @geohci on Aug 24, 2022, 01:13

@appledora I think we can close this with the introduction of the depth-first search approach? Or do you want to leave open for now?

appledora commented 2 years ago

We can close this, I think. Right now we are not facing any troubles with transclusion detection.