try to improve import speed.

walkinrain2008 commented 5 years ago

my computer import pyjsparser is very slowly. test code :

print("begin")
ticks = []
import time
tick_s = time.perf_counter()
import pyjsparser.pyjsparserdata
ticks.append(time.perf_counter() - tick_s)
print("import time =",ticks)

result is:

begin
import time = [1.543109415]

i found slowly code is :

    for c in map(unichr, range(sys.maxunicode + 1)):
        U_CATEGORIES[unicodedata.category(c)].append(c)

it is found that 1 var U_CATEGORIES is very large 2 var U_CATEGORIES is only used to assign values to other variables. i change the code for pyjsparserdata.py in line224

import pickle
import os 
fn = os.path.dirname(os.path.abspath(__file__))
fn = os.path.join(fn,"u_dict.bin")
try:
    with open(fn, "rb") as f:
        UNICODE_LETTER,UNICODE_COMBINING_MARK,UNICODE_DIGIT,UNICODE_CONNECTOR_PUNCTUATION,IDENTIFIER_START,IDENTIFIER_PART = pickle.load(f)
except FileNotFoundError as e:
    for c in map(unichr, range(sys.maxunicode + 1)):
        U_CATEGORIES[unicodedata.category(c)].append(c)
    UNICODE_LETTER = set(U_CATEGORIES['Lu']+U_CATEGORIES['Ll']+
                         U_CATEGORIES['Lt']+U_CATEGORIES['Lm']+
                         U_CATEGORIES['Lo']+U_CATEGORIES['Nl'])
    UNICODE_COMBINING_MARK = set(U_CATEGORIES['Mn']+U_CATEGORIES['Mc'])
    UNICODE_DIGIT = set(U_CATEGORIES['Nd'])
    UNICODE_CONNECTOR_PUNCTUATION = set(U_CATEGORIES['Pc'])
    IDENTIFIER_START = UNICODE_LETTER.union(set(('$','_', '\\'))) # and some fucking unicode escape sequence
    IDENTIFIER_PART = IDENTIFIER_START.union(UNICODE_COMBINING_MARK).union(UNICODE_DIGIT)\
        .union(UNICODE_CONNECTOR_PUNCTUATION).union(set((ZWJ, ZWNJ)))
    saveVar=[UNICODE_LETTER,UNICODE_COMBINING_MARK,UNICODE_DIGIT,UNICODE_CONNECTOR_PUNCTUATION,IDENTIFIER_START,IDENTIFIER_PART]
    with open(fn,"wb") as f:
        pickle.dump(saveVar, f)

test result: first run:

begin
import time = [1.836442004]

second run:

begin
import time = [0.240046614]

i hope this code is useful for your module . my English is poor , forgive me !

walkinrain2008 commented 5 years ago

i search vars in *.py , found that only IDENTIFIER_START,IDENTIFIER_PART is useful only save / load IDENTIFIER_START & IDENTIFIER_PART second run result:

begin
import time = [0.170071518]

i found that IDENTIFIER_START & IDENTIFIER_PART is only used to function like:

def isIdentifierStart(ch):
    return (ch if isinstance(ch, unicode) else unichr(ch))  in IDENTIFIER_START

is it possible to use other methods? like :

c_LETTER = ['Lu','Ll','Lt','Lm','Lo','Nl']
c_PART = ['Lu','Ll','Lt','Lm','Lo','Nl','Mn','Mc','Nd','Pc']
def isIdentifierStart(ch):
    c= ch if isinstance(ch, unicode) else unichr(ch)
    return (unicodedata.category(c) in c_Letter) or (c in ('$','_', '\\'))

i hope this code is useful for you.

andreymal commented 5 years ago

U_CATEGORIES also uses a lot of memory (~100MB)

This is a big problem for me

PiotrDabkowski commented 5 years ago

I agree the current implementation is not ideal, will try to fix it when I have some time.

PiotrDabkowski commented 5 years ago

Fixed in 3467b65

PiotrDabkowski / pyjsparser

try to improve import speed. #23