goodmami commented 4 years ago

This issue pertains to the formats that wordnet data starts in and how we might store it. See #1 for thoughts on internal representations.

Some concerns:

Non-English wordnets may target the Princeton WordNet 3.0's synsets as a backbone, or perhaps some other version/source (3.1, or ILI?)
Data may be distributed in ~~WNDB format,~~ LMF, or perhaps other formats (sqlite database, RDF store); what do we support?
- According to Francis, WNDB is deprecated and should not be supported
Once the source files are parsed, do we store our own caches or perhaps a relational database to optimize future operations?

goodmami commented 4 years ago

There may be some YAML format in the works; apparently it's 1/3 the filesize of the LMF XML. https://github.com/globalwordnet/english-wordnet/issues/31

goodmami commented 4 years ago

Compressed size isn't so different. The English Wordnet 2020 release is 11.1MB compressed WNDB and 13.6MB compressed XML, but uncompressed is 36.6MB vs 103.9MB. I think the YAML data is similar in size to the WNDB when uncompressed.

The bad thing about the XML is it probably takes longer to load because it has to parse 103.9MB of XML vs 36.6MB of simple text files. If true, this would lend support to us creating some optimized local storage, such as an sql database or even just pickle. That is, load the LMF file that was distributed, then cache the loaded data in a way that's faster to load the next time.

alvations commented 4 years ago

Parsing XML can be incremental. For LMF, we can use a similar approach to how we parse the TMX files (with ~50M translations), e.g. https://www.gala-global.org/tmx-14b

@fcbond @goodmami Are there any proper documentation of LMF definition that we can start with?

goodmami commented 4 years ago

Parsing XML can be incremental.

5 uses xml.etree.ElementTree.iterparse(), which is incremental, and I clear the root node after each LexicalEntry or Synset so the document isn't stored entirely in memory. The in-memory size of the loaded LMF file is just the Python objects containing the data, not the ElementTree objects, I think.

For documentation I was using http://globalwordnet.github.io/schemas/#xml, WN-LMF-1.0.dtd, and the english-wordnet-2020.xml release as an example.

goodmami commented 4 years ago

Btw when I tried to pickle the result of parsing the LMF file, it was not much faster to dump or load than just reading the XML.

goodmami commented 4 years ago

I still think that a sqlite3 database might make a good back end, and that we could store all the information in an LMF file but only present to the user the simple stuff. That way we could keep open the possibility to dump LMF files, or power users could access the db directly. The average user could be completely ignorant of how the data is actually stored.

But if that's off the table, shelve might be enough. E.g.,

with shelve.open(wn_data_dir) as db:
    for synset in synsets:
        db[synset.ili] = synset

alvations / gown

How to load and store data #2

5 uses xml.etree.ElementTree.iterparse(), which is incremental, and I clear the root node after each LexicalEntry or Synset so the document isn't stored entirely in memory. The in-memory size of the loaded LMF file is just the Python objects containing the data, not the ElementTree objects, I think.