Frenzie / myopera-backup

A Python script to grab posts and as much relevant metadata as possible from MyOpera.
GNU General Public License v2.0
0 stars 0 forks source link

Extracting data #5

Open Frenzie opened 11 years ago

Frenzie commented 11 years ago

When all the data has been collected, it needs to be processed more thoroughly into useful bits and pieces. I already wrote some potentially interesting or helpful things earlier.

From v0.1

# Decode HTML entities
# Thanks to http://stackoverflow.com/a/2087433
import html.parser
h = html.parser.HTMLParser()
post_text = h.unescape(post_text)

From v0.2.1

comments_regex = r'''
<div class="fpost.*?" id=".+?">
<a name="comment[0-9]+"></a><p class="posted">(?:<span class="unread">unread</span>)?<a href="findpost\.pl\?id=([0-9]+)" title="permanent link to post"> (.+?)</a>(?: <b>\((edited)\)</b>)?</p>
<div class="pad">
<div class="poster">
(?:<img src=".+?" width="72" height="29" alt="(.+?)" title=".+?" class="right">)?<a href=".+?"><img src=".+?" alt="" class="forumavatar"></a><p><b><a href=".+?"(?: title=".+?")?>(.+?)</a></b></p>
<p>.*?</p>
<p class="userposts">Posts: <a href=".+?">[0-9]+</a></p>
</div>
<div class="thepost">((?:\n)?.+?(?:<div class="forumpoll">.+?</div>)?)(?:<div class="sig">(.+?)(?:\n)?</div>)?(?:\n)?</div>'''

# re.DOTALL makes dot also match newlines
comments = re.findall(comments_regex, page, re.DOTALL)

###############
# enter individual comments for loop
for comment in comments:
    comment_id = comment[0]
    timestamp = comment[1]
    edited = comment[2]
    user_status = comment[3]
    user = comment[4]
    signature = comment[6]
    post_text = comment[5]
Frenzie commented 11 years ago

The MyOpera Enhancements UserJS comes with the function treeToBBCode().

/*
* treeToBBCode() - parses the tree into bbcode
*/
function treeToBBCode(node){
    var bb = [];
    if( typeof node.item == 'function' ){
        for(var k=0,n;n=node[k++];)
            bb.push(treeToBBCode(n));
        return bb.join('');
    }

    if( node.getAttribute && node.getAttribute('userjsishidden')=='true' ){
        return;
    }

    switch(node.nodeType){
    case Node.ELEMENT_NODE:
        var nname = node.nodeName.toLowerCase();
        var def = treeToBBCode.defaults[nname];
        if( def ){
            //generic behavior
            bb.push(def.before||'');
            bb.push(treeToBBCode(node.childNodes));
            bb.push(def.after||'');
        }
        else{
            //special cases
            switch(nname){
            case 'a':
                if( node.href.indexOf("mailto:")==0 ){
                    bb.push('[EMAIL='+node.href.substring(7)+']');
                    bb.push(treeToBBCode(node.childNodes));
                    bb.push('[/EMAIL]');
                }
                else if( node.className.indexOf("attach")>=0 ){
                    bb.push('[ATTACH='+node.href+']');
                    bb.push(treeToBBCode(node.childNodes));
                    bb.push('[/ATTACH]');
                }
                else{
                    bb.push('[URL='+node.href+']');
                    bb.push(treeToBBCode(node.childNodes));
                    bb.push('[/URL]');
                }
                break;
            case 'img':
                var smileyCode = getSmileyCode(node);
                bb.push( smileyCode ? ' '+smileyCode+' ' : '[IMG='+node.src+']');
                break;
            case 'ol':
                var type = node.className.indexOf("alpha")>=0 ? 'a' : '1';
            case 'ul':
                bb.push('[LIST'+(type?'='+type:'')+']');
                var lis = node.selectNodes('li');
                for(var k=0,li;li=lis[k++];)
                    bb.push('\n  [*] '+trim(treeToBBCode(li)));
                bb.push('[/LIST]');
                break;
            case 'span':
                //check for css properties
                var props=[
                    {name:'textDecoration',forceValue:'underline',before:'[U]',after:'[/U]'},
                    {name:'color',before:'[COLOR=@value]',after:'[/COLOR]'},
                    {name:'fontFamily',before:'[FONT=@value]',after:'[/FONT]'},
                    {name:'fontSize',before:'[SIZE=@value]',after:'[/SIZE]',values:{
                        'xx-small':1,
                        'x-small':2,
                        'small':3,
                        'medium':4,
                        'large':5,
                        'x-large':6,
                        'xx-large':7
                    }}
                ];
                var start='', end='';
                for(var k=0,p;p=props[k++];){
                    var value = trim(node.style[p.name]||'',' "');
                    if( ( p.forceValue && value==p.forceValue ) || ( !p.forceValue && value ) ){
                        start += p.before.replace('@value',(p.values ? p.values[value]:null) || value);
                        end += p.after;
                    };
                };
                //check for class attribute
                props=[
                    {name:'alignleft',before:'[ALIGN=left]',after:'[/ALIGN]'},
                    {name:'aligncenter',before:'[ALIGN=center]',after:'[/ALIGN]'},
                    {name:'alignright',before:'[ALIGN=right]',after:'[/ALIGN]'},
                    {name:'alignjustify',before:'[ALIGN=justify]',after:'[/ALIGN]'}
                ];
                for(var k=0,p;p=props[k++];){
                    if( node.className.indexOf(p.name)>=0 ){
                        start += p.before;
                        end += p.after;
                    };
                };
                bb.push(start);
                bb.push(treeToBBCode(node.childNodes));
                bb.push(end);
                break;
            case 'p':
                var ns = node.nextElementSibling||node.nextSibling;
                //detect quote
                if( node.className.indexOf("cite") >= 0 &&
                    ns &&
                    ns.nodeName.toLowerCase()=='blockquote' &&
                    ns.className.indexOf("bbquote") >= 0 ){
                    //TODO: user quote - this will break when the forums get localized !
                    ns.__userNameQuoted = node.textContent.replace(/.*originally\s+posted\s+by\s+/i,'').replace(/\s*\:$/,'');
                }
                else{
                    bb.push(treeToBBCode(node.childNodes));
                }
                break;
            case 'blockquote':
                if( node.className.indexOf("bbquote") >= 0 ){
                    bb.push('[QUOTE'+(node.__userNameQuoted?'='+node.__userNameQuoted:'')+']');
                    bb.push(treeToBBCode(node.childNodes));
                    bb.push('[/QUOTE]');
                }
                else{
                    bb.push(treeToBBCode(node.childNodes));
                }
                break;
            default:
                bb.push(treeToBBCode(node.childNodes));
                break;
            };
        }
        break;
    case Node.DOCUMENT_NODE:// 9
    case Node.DOCUMENT_FRAGMENT_NODE:// 11
        bb.push(treeToBBCode(node.childNodes));
        break;
    case Node.TEXT_NODE://3
    case Node.CDATA_SECTION_NODE:// 4
        var text = node.nodeValue;
        if (!node.selectSingleNode('ancestor::pre'))
            text = text.replace(/\n[ \t]+/g,'\n')
        bb.push(text);
        break;
    }
    return bb.join('');
};
treeToBBCode.defaults = {
    strong:{before:'[B]',after:'[/B]'},
    b:{before:'[B]',after:'[/B]'},
    i:{before:'[I]',after:'[/I]'},
    em:{before:'[I]',after:'[/I]'},
    s:{before:'[S]',after:'[/S]'},
    sup:{before:'[SUP]',after:'[/SUP]'},
    sub:{before:'[SUB]',after:'[/SUB]'},
    pre:{before:'[CODE]',after:'[/CODE]'},
    br:{before:'\n',after:''}
};
Frenzie commented 11 years ago

Reading specific lines without bothering with the whole file: http://stackoverflow.com/a/2081880 (not terribly relevant for these small files, but still)

Apparently better than os.walk for directory traversal: https://github.com/benhoyt/scandir