cebel / pyuniprot

Python package to query and analyse UniProt
Apache License 2.0
22 stars 8 forks source link

Memory leak in update function #5

Closed cebel closed 7 years ago

cebel commented 7 years ago

import pyuniprot pyuniprot.update(taxids=[9606,10090,10116]) # human, mouse, rat

After ~42k entries in the database the update process consumes ~12Gb of memory. I assume a problem in lxml. Found several articles describing similar problems with big XML files. BUT here I avoid to load the whole document in the iterparser (tested is and directly starts with 5Gb of memory consumption and then constantly increases). If the problem won't be solved it seem not feasible to load the whole UniProt.

cebel commented 7 years ago

Found this article recommends to set collect_ids=False like I did. parser = etree.XMLParser(collect_ids=False) entries = etree.fromstringlist(entries_xml, parser) But even this not helps.

cebel commented 7 years ago

I found a workaround: Reload lxml etree for every chunk of XML entries. Stupid, but it works and memory usage dramatically is reduced.