CordySmith / PySmaz

A Python port of SMAZ small text string compression library
Other
32 stars 6 forks source link

how to generate own DECODE = [.. ] table #2

Open thestick613 opened 9 years ago

thestick613 commented 9 years ago

I'm trying to use smaz to compress some URL's, but most of the words aren't even in english. My data also doesn't contain the sequence "s o", or " ha". I could probably replace some of these occurrences with ".html" or ".jpg" and other stuff from the url's, but the smart idea seems to be generating one from the data itself. Any idea on that? Should i just extract the most common 2-3 bytes sequences?

CordySmith commented 9 years ago

Hi, Short answer: It's tricky Basically you need something like a "Huffman tree" that tells you the probability of particular character sequences in a given sample corpus ( https://raw.githubusercontent.com/CordySmith/PySmaz/master/tests/data/final-url-en.txt )Big sequences that occur very frequently are the best to spend bits on, pick you sequences, and then you need to hand tune it (you could use something like a genetic algorithm). Be very wary of 'over-fitting' your test corpus. Keep a secondary corpus as a quality check. I've actually got a couple of url dictionaries I've been experimenting with (see bottom of the mail) One thing you have to be careful of, is that the naive compressor will match the first token it sees - i.e. if 'appy', 'h', 'ha' are in the dictionary and it sees 'happy' it will naturally choose 'ha' and then encode the rest as a raw string. I have an experimental version of the compressor that uses a 'largest fragment first' strategy, but it's not really ready for prime-time, and careful dictionary selection will largely eliminate such problems. I'll try and push a few performance tweaks into the repository. Cheers, Max. URLDECODER = [    'http://', 'https://', 'http://www.', 'https://www.', 'http', 'ftp', '://', 'www.',     # Common numbers, in particular dates    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',    '00', '01', '02', '03', '04', '05', '06', '07', '08', '09',    '10', '11', '12', '13', '14', '15', '16', '17', '18', '19',    '20', '21', '22', '23', '24', '25', '26', '27', '28', '29',    '30', '31',    '99', '199', '200', '201', '202', '203', '000', '0000'  # Years and big numbers    '=0', '=1', '=2', '=3', '=4', '=5', '=6', '.1', '.2',  # Ip addresses and article=012     # Common extensions    '.asp', '.aspx', '.php', '.jsp', '.shtml', 'cgi', 'cgi-bin/', '.cfm', 'flv', '.pl'    '0.html', '1.html', '2.html', '3.html', 's.html',    '.html', '.htm', '.jpg', '.png', '.gif', '.css', '.pdf', '.doc', '.xml', 'txt',     # Common url parts    'news', 'review', 'web', 'blog', 'blogs', 'search', 'content', 'wiki', 'index',    'default', 'view', 'topic', 'profile', 'ocument', 'archive', 'article', 'articles', 'thread', 'print', 'history',    'net', 'detail', 'glance', 'pages', 'find', 'usiness', 'eature', 'faq', 'mail', 'upload'    'tip', 'issue', 'table', 'source', 'time', 'erson', 'day', 'resource', 'ublications', 'exec',    'story', 'world'     # Common word bits    'oo', 'the', 'th', 'and', 'to', 'in', 'is', 'for', 'um', 'it',    'er', 'an', 'ing', 'ion', 'ation', 'ati', 'sto', 'or', 'es', 'ar',    'on', 'ti', 're', 'ne', 'en', 'te', 'st', 'le', 'ic', 'nt',    'g/', 'as', '/a',     # Usable values in urls    'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',    'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',    'u', 'v', 'w', 'x', 'y', 'z', '=', '?', '@', '~',    'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J',    'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T',    'U', 'V', 'W', 'X', 'Y', 'Z', '!', '$', '&', "'",    '(', ')', '', '+', ',', '-', '.', '/', ':', ';',    '//~'     # Common url bits    's.com/', 'e.com/', 't.com/', '.com', '.com/', '.com/article',    '.net', '.org', '.co.uk/', '.de', '.edu',  '.au', 'gov',    ] URL_DECODER2 = [    'http://', 'https://', 'http', 'ftp', '://', 'www.', 'http://www.'     # Common numbers, in particular dates    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',    '00', '01', '02', '03', '04', '05', '06', '07', '08', '09',    '10', '11', '12', '13', '14', '15', '16', '17', '18', '19',    '20', '21', '22', '23', '24', '25', '26', '27', '28', '29',    '30', '31',    '99', '199', '200', '201', '202', '203', '000', '0000'  # Years and big numbers    '=0', '=1', '=2', '=3', '=4', '=5', '=6', '.1', '.2',  # Ip addresses and article=012     # Common extensions    '.asp', '.aspx', '.php', '.jsp', '.shtml', 'cgi', 'cgi-bin/', '.cfm', 'flv', '.pl',    #  's.html',    '.html', '.htm', '.jpg', '.png', '.gif', '.css', 'pdf', 'doc', 'xml', 'txt',     # Common url parts    'news', 'review', 'web', 'blog', 'blogs', 'search', 'content', 'wiki', 'index',    'default', 'view', 'topic', 'profile', 'document', 'archive', 'article', 'thread', 'print', 'history',    'net', 'detail', 'glance', 'pages', 'find', 'business', 'feature', 'faq', 'mail', 'upload'    'tip', 'issue', 'table', 'source', 'time', 'person', 'day', 'resource', 'publications', 'exec',    'story', 'world', 'gallery', 'info', 'mag', 'show'     # Common word bits    'oo', 'th', 'to', 'in', 'is', 'um', 'it', 'er', 'an',    'or', 'es', 'ar', 'me', 'us', 'id', 'on', 'as', 'al',    'ra', 'uk', 'ac', 'nu', 'ca', 'tv', 'ti', 're', 'ne',    'en', 'te', 'st', 'le', 'ic', 'nt',     'ing', 'ion', 'ation', 'ati', 'sto', 'ent', 'ter', 'and', 'the', 'tio', 'for',     # Either never get hit, or induce token starvation    # '/a', 'g/',    # '0.html', '1.html', '2.html',     # Usable values in urls    'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',    'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',    'u', 'v', 'w', 'x', 'y', 'z', '=', '?', '@', '~',    'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J',    'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T',    'U', 'V', 'W', 'X', 'Y', 'Z', '!', '$', '&', "'",    '(', ')', '', '+', ',', '-', '.', '/', ':', ';',     # Common url bits    '.com/', '.net', '.org', '.co.uk/', '.com/article',    'de', 'edu', 'au', 'gov', 'mil',    ]

  From: thestick613 <notifications@github.com>

To: CordySmith/PySmaz PySmaz@noreply.github.com Sent: Tuesday, 3 March 2015, 0:54 Subject: [PySmaz] how to generate own DECODE = [.. ] table (#2)

I'm trying to use smaz to compress some URL's, but most of the words aren't even in english. My data also doesn't contain the sequence "s o", or " ha". I could probably replace some of these occurrences with ".html" or ".jpg" and other stuff from the url's, but the smart idea seems to be generating one from the data itself. Any idea on that? Should i just extract the most common 2-3 bytes sequences?— Reply to this email directly or view it on GitHub.

thestick613 commented 8 years ago

Thank you for your response, i've experimented with building my own table and with your own URL_DECODER, but the improvement margin was minimal. I've accumulated enough data in the last 6 months to fine-tune a new, better table. I just need to make a huffman tree out of it, sort it by by lenght and then alphanumerically, and it should be ok. I'll try to make a PR with the table generator.