Open goodmami opened 4 years ago
There may be some YAML format in the works; apparently it's 1/3 the filesize of the LMF XML. https://github.com/globalwordnet/english-wordnet/issues/31
Compressed size isn't so different. The English Wordnet 2020 release is 11.1MB compressed WNDB and 13.6MB compressed XML, but uncompressed is 36.6MB vs 103.9MB. I think the YAML data is similar in size to the WNDB when uncompressed.
The bad thing about the XML is it probably takes longer to load because it has to parse 103.9MB of XML vs 36.6MB of simple text files. If true, this would lend support to us creating some optimized local storage, such as an sql database or even just pickle. That is, load the LMF file that was distributed, then cache the loaded data in a way that's faster to load the next time.
Parsing XML can be incremental. For LMF, we can use a similar approach to how we parse the TMX files (with ~50M translations), e.g. https://www.gala-global.org/tmx-14b
@fcbond @goodmami Are there any proper documentation of LMF definition that we can start with?
Parsing XML can be incremental.
For documentation I was using http://globalwordnet.github.io/schemas/#xml, WN-LMF-1.0.dtd, and the english-wordnet-2020.xml release as an example.
Btw when I tried to pickle the result of parsing the LMF file, it was not much faster to dump or load than just reading the XML.
I still think that a sqlite3 database might make a good back end, and that we could store all the information in an LMF file but only present to the user the simple stuff. That way we could keep open the possibility to dump LMF files, or power users could access the db directly. The average user could be completely ignorant of how the data is actually stored.
But if that's off the table, shelve might be enough. E.g.,
with shelve.open(wn_data_dir) as db:
for synset in synsets:
db[synset.ili] = synset
This issue pertains to the formats that wordnet data starts in and how we might store it. See #1 for thoughts on internal representations.
Some concerns:
WNDB format,LMF, or perhaps other formats (sqlite database, RDF store); what do we support?