earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
752 stars 75 forks source link

Doc about parsing wiki tables #93

Open lmorillas opened 9 years ago

lmorillas commented 9 years ago

Docs say that new release can parse wiki tables, but it's not documented. How can I parse a wiki table? Is there an special filter?

earwig commented 9 years ago

There's no special filter (right now) – tables are just parsed as a special kind of HTML tag, so you can use filter_tags. For example:

>>> import mwparserfromhell
>>> text = """{|
... |-
... | testing
... |}"""
>>> code = mwparserfromhell.parse(text)
>>> code.filter_tags(matches=lambda node: node.tag == "table")
[u'{|\n|-\n| testing\n|}']
>>> print code.get_tree()
<
      table
>
      <
            tr
      >
            <
                  td
            >
                   testing\n
            </
                  td
            >
      </
            tr
      >
</
      table
>

It's a little clunky if you actually want to manipulate the tables... I'm not sure what proper methods would even look like. At any rate, I'm leaving this open as a reminder to document how these less-obvious features work.

shrikantp-vbt commented 7 years ago

I want to extract data from wikitable here https://en.wikipedia.org/wiki/OHL_Classic_at_Mayakoba but only the rows with columnspan = 10 so I want current and all previous names of the tournament e.g. 1) OHL Classic at Mayakoba 2) Mayakoba Golf Classic 3) Mayakoba Golf Classic at Riviera Maya-Cancun Will it be possible using filter_tags

I want to also do some validation i.e. I only want to look at winners table , there can be other tables on the page which I don't want look at. Within such table , only want to look at rows which span over all the columns and get its text.

Let me know the approach using code.filter_<>() methods. Or you think it's easier to do it using Python regex on whole wiki page markup.

suhassumukh commented 5 years ago

Have the wiki table manipulation methods been updated? Documented? Is it the same situation for lists? I was looking at methods that can access individual table cells or list elements.

earwig commented 5 years ago

The only methods we currently have for this are the normal HTML tag traversal methods. What you want to do should be possible with those, but it’s not ideal. I would like to add more tailored things in the future, but this hasn’t happened yet.

On Apr 30, 2019, at 7:52 AM, suhassumukhv notifications@github.com wrote:

Have the wiki table manipulation methods been updated? Documented? Is the same situation for lists? I was looking at methods that can access individual table cells or list elements.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

TheSandDoctor commented 5 years ago

I would also second a feature like this.

MeitarR commented 2 years ago

There's no special filter (right now) – tables are just parsed as a special kind of HTML tag, so you can use filter_tags. For example:

>>> import mwparserfromhell
>>> text = """{|
... |-
... | testing
... |}"""
>>> code = mwparserfromhell.parse(text)
>>> code.filter_tags(matches=lambda node: node.tag == "table")
[u'{|\n|-\n| testing\n|}']
>>> print code.get_tree()
<
      table
>
      <
            tr
      >
            <
                  td
            >
                   testing\n
            </
                  td
            >
      </
            tr
      >
</
      table
>

It's a little clunky if you actually want to manipulate the tables... I'm not sure what proper methods would even look like. At any rate, I'm leaving this open as a reminder to document how these less-obvious features work. @earwig

for manipulation, the right way will probably be implementing something like smart_list but for tables

ryandward commented 9 months ago

There's no special filter (right now) – tables are just parsed as a special kind of HTML tag, so you can use filter_tags. For example:

>>> import mwparserfromhell
>>> text = """{|
... |-
... | testing
... |}"""
>>> code = mwparserfromhell.parse(text)
>>> code.filter_tags(matches=lambda node: node.tag == "table")
[u'{|\n|-\n| testing\n|}']
>>> print code.get_tree()
<
      table
>
      <
            tr
      >
            <
                  td
            >
                   testing\n
            </
                  td
            >
      </
            tr
      >
</
      table
>

It's a little clunky if you actually want to manipulate the tables... I'm not sure what proper methods would even look like. At any rate, I'm leaving this open as a reminder to document how these less-obvious features work. @earwig

for manipulation, the right way will probably be implementing something like smart_list but for tables

This actually works fairly well, but runs into some kind of problems with with nested tables. the smart_list is incredible, and would love to see something implemented in the package to handle tables similarly. Currently, I am working on recursing on my own, but the code is becoming ugly -- but I think it's manageable.

Thanks for the package @earwig.

ryandward commented 9 months ago

I'll leave this here if anyone wants it. It saves having to clean up the html elements that get split up

def wiki_link_to_html(node):
    text = str(node.title)
    return f'<a href="#">{text}</a>'

def wiki_table_to_html(node):
    result = ['<table>']
    for row in node.contents.nodes:
        if isinstance(row, mwparserfromhell.nodes.Tag) and row.tag == 'tr':
            result.append('<tr>')
            for cell in row.contents.nodes:
                if isinstance(cell, mwparserfromhell.nodes.Tag) and cell.tag in ['td', 'th']:
                    result.append(f'<{cell.tag}>')
                    for content in cell.contents.nodes:
                        if isinstance(content, mwparserfromhell.nodes.Text):
                            result.append(str(content))
                        elif isinstance(content, mwparserfromhell.nodes.Wikilink):
                            result.append(wiki_link_to_html(content))
                    result.append(f'</{cell.tag}>')
            result.append('</tr>')
    result.append('</table>')
    return ''.join(result)

wiki_text = """
{| class="eoTable2 sortable" style="text-align:center" 
|-
! Spell !! Level !! Component A !! Component B !! Component C !! Trivial !! Mana Efficiency (Damage per Mana) Assumes 4 Targets & No Resists
|-
| [[Pillar of Fire]] || 16 || [[Rune of Nagafen]] || [[Rune of Proximity]] || || 22 || 3.6
|-
| [[Project Lightning]] || 16 || [[Rune of Fulguration]] || [[Rune of Periphery]] || || 21? || PBAoE
|}"""

wikicode = mwparserfromhell.parse(wiki_text)
html_text = wiki_table_to_html(wikicode.filter_tags(matches='table')[0])

print(html_text)