IlyaGusev / rnnmorph

Morphological analyzer for Russian and English languages based on neural networks and dictionary-lookup systems.
Apache License 2.0
152 stars 24 forks source link

Implementation in a loop clogs up memory #6

Open molokanov50 opened 1 year ago

molokanov50 commented 1 year ago

There is a need for me to determine grammatical case for terms in texts of a big dataset. I found that the increment of memory usage as large as 0.3 to 0.7 MB occurs virtually every call of forms = predictor.predict(terms). Consider a simple example:

def findCase(termNumber, text):  # нахождение падежа термина с указанным номером в тексте
    terms = text.split()
    forms = predictor.predict(terms)
    myTag = forms[termNumber].tag
    parts = re.split('\\|', myTag)
    for part in parts:
        subparts = re.split('=', part)
        if len(subparts) < 2:
            continue
        if subparts[0] == 'Case':
            return subparts[1].upper()
    return 'UNDEF'

And then, if I have a collection of texts, i can implement:

myDict = {}
for i in range(len(texts)):
    case = findCase(0, texts[i])
    myDict[i] = case

I have 12500 texts with average length of about 700 symbols each. Running all my dataset required me extra 1.5 GB of memory due to utilizing predictor.predict(terms). Seems like my local variable forms remains in the memory after completing the method, but really, is your RNNMorphPredictor model maybe self-trained in this scenario? How to free this volume of memory?

molokanov50 commented 1 year ago

Update: there is no obvious difference depending on the length of every single text. I reduced input text length down to 10 tokens, or approx. 80 symbols only. Memory usage is the same - 1.5 GB per 12500 texts. Thereby my question becomes even more actual.