bencabrera / grawitas

Grawitas is a lightweight, fast parser for Wikipedia talk pages that takes the raw Wikipedia-syntax and outputs the structured content in various formats.
MIT License
7 stars 5 forks source link

How to pass the extracted talk file to grawitas_cli_core #12

Closed AiliAili closed 6 years ago

AiliAili commented 6 years ago

Hi, I am using grawitas_cli_core to extract comments from the extracted talk page. It always returns null. How to make grawitas_cli_core can accept already extracted talk pages?

I tried to parse this one:

{{Film|British-task-force=yes |class=start |Canadian-task-force=yes}}

== Blacklisted Links Found on the Main Page ==

Cyberbot II has detected that page contains external links that have either been globally or locally blacklisted. Links tend to be blacklisted because they have a history of being spammed, or are highly innappropriate for Wikipedia. This, however, doesn't necessarily mean it's spam, or not a good link. If the link is a good link, you may wish to request whitelisting by going to the [[MediaWiki talk:Spam-whitelist|request page for whitelisting]]. If you feel the link being caught by the blacklist is a false positive, or no longer needed on the blacklist, you may request the regex be removed or altered at the [[MediaWiki talk:Spam-blacklist|blacklist request page]]. If the link is blacklisted globally and you feel the above applies you may request to whitelist it using the before mentioned request page, or request it's removal, or alteration, at the [[meta:Talk:Spam Blacklist|request page on meta]]. When requesting whitelisting, be sure to supply the link to be whitelisted and wrap the link in nowiki tags. The whitelisting process can take its time so once a request has been filled out, you may set the invisible parameter on the tag to true. Please be aware that the bot will replace removed tags, and will remove misplaced tags regularly.

'''Below is a list of links that were found on the main page:'''

http://www.theclassicalshop.net/mp3samples/CH/CHAN241-1201T01D02.wma :''Triggered by \btheclassicalshop.net\b on the local blacklist''

If you would like me to provide more information on the talk page, contact [[User:Cyberpower678]] and ask him to program me with more info.

From your friendly hard working bot.—[[User:Cyberbot II|cyberbot II]] [[User talk:Cyberbot II|Notify]]Online 18:39, 8 December 2013 (UTC)

It always returns null.

bencabrera commented 6 years ago

Hi,

sorry for the delayed response. I'm not sure I understand everything you say. What do you mean by "make grawitas_cli_core [...] accept already extracted talk pages". You mean that if you already have the talk page as a file somewhere how to parse it? grawitas_cli_core is indeed the correct program for that.

If you are having problem with a specific talk page, maybe first try another one to verify that the program is in principle working as expected. If you upload the talk page file not working (instead of copying it into the text - this can mess up formating) to this discussion, I can also try to see why it is not parsed correctly.

Ben

AiliAili commented 6 years ago

Thanks for your reply. I find that for the abovementioned text it always returns null. One reason I think is that the talk page does not contain any comment signal (such as : or signature) . So the parser returns null. It works well for talk page containing obvious comment format.