Add LMF reader - Githubissues

goodmami commented 4 years ago

This branch builds on the "scaffold" branch (I think if you merge this you don't need to merge the other).

I add wn.lmf for reading LMF-XML files. It has one publicly accessible function:

>>> from wn import lmf
>>> lmf.load('path/to/wordnet.xml')

It loads the wordnet data into basic data structures that mirror the DTD for the format. We can use this to ingest the data into our internal representation. It would also be almost trivial to write a function to dump LMF data from the same data structures. The code uses typing.NamedTuple extensively to model the DTD, and it currently fully type-checks. It also passes flake8 tests with "maximum-line-length" set to 99.

It loads the English WordNet 2020 release (~100MB of XML) in about 11 seconds on my laptop and uses about 400MB of RAM to do so.

It does not yet have any unit tests, but here is a good minimal test case.

alvations commented 4 years ago

I think we can be less verbose in the LMF objects, e.g. these seems to be a common set of attributes for the objects:

status: Optional[str] = None
note: Optional[str] = None
confidence: Optional[float] = None
dcmeta: Optional[DublinCoreMetadata] = None

Perhaps we can do some sort of Mixins:

class MetaData:
    status: Optional[str] = None
    note: Optional[str] = None
    confidence: Optional[float] = None
    dcmeta: Optional[DublinCoreMetadata] = None

class Definition(NamedTuple, MetaData):
    text: str
    language: Optional[str] = None
    source_sense: Optional[str] = None

class IliProperties:
    ili: str
    ili_definition: Optional[str] = None

class Synset(NamedTuple, MetaData, IliProperties):
    id: str
    pos: Optional[str] = None  # Literal[*_pos] if Python 3.8+
    definitions: Tuple[Definition, ...] = ()
    relations: Tuple[SynsetRelation, ...] = ()
    examples: Tuple[Example, ...] = ()

goodmami commented 4 years ago

I don't think [typing.NamedTuple]() can be subclassed that way:

>>> from typing import NamedTuple
>>> class A(NamedTuple):
...     a: int
... 
>>> class B(A):
...     b: str
... 
>>> A(1)
A(a=1)
>>> B(1, 'b')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __new__() takes 2 positional arguments but 3 were given

But we could include the Metadata as a member of the other classes, similar to how the DublinCoreMetadata is done. I was trying to keep the classes tightly modeled to the LMF description, but it seems like that may be a bad idea if there are potential structural changes from the current 1.0 format (e.g., https://github.com/globalwordnet/schemas/issues/17).

goodmami commented 4 years ago

Thinking about the next step, which is to load the data parsed from LMF into an internal structure or storage, I was thinking about a relational database backend. If we go in this direction, it would be nice to have a clear interface "barrier" so users are not expected to construct sql queries, but instead they request synsets or lemmas and we perform the search and construct these objects. Also I think we can use a much simpler schema than OMW uses, since we're not concerned with constructing new wordnets.

We might also consider an ORM. SQLAlchemy is the standard package for this, but it is much more than we need. Pony ORM looks pretty nice and I hear it's very fast. With an ORM, the modeling of LMF in lmf.py would become largely redundant with the database models. For example:

class Synset(DbEntity):
    id = PrimaryKey(str)
    ili = Required(str)
    pos = Optional(str)
    definitions = Set(Definition)
    ili_definition = Optional(ILIDefinition)
    relations = Set(SynsetRelation)
    examples = Set(Example)
    metadata = Optional(Metadata)

alvations commented 4 years ago

I still think any SQL or DB is an overkill =) Even neural stuff with billions of parameters can be handle (with lots of hack).

alvations commented 4 years ago

Actually on NamedTuple? Why should WordNet objects be subclassed on namedtuples? Shouldn't these objects get functions as well?

A general underspecified object migh work, e.g.

class MetaData:
    status: Optional[str] = None
    note: Optional[str] = None
    confidence: Optional[float] = None
    dcmeta: Optional[DublinCoreMetadata] = None

class Definition(MetaData):
    text: str
    language: Optional[str] = None
    source_sense: Optional[str] = None

Maybe class Definition(Generic) or NewType?

Note that there's some assumptions made here; OOP-wise, it looks like inheritance but if we think about it, in the example above, it's merely that Definition here shares all attributes that MetaData has.

alvations commented 4 years ago

Lets merge this first and then iterate from there?

Currently, I think it's best to use as many native types/objects as possible. E.g. maybe definition doesn't need to be its own class but it looks like everything has some sort of meta-data attached to them.

From a user perspective, I think these metadata are not necessary. I.e. if a user intends to fetch a synset's definition, just a str type without all the meta is okay.

What if we do something a little different, WordNet objects are kept as native as possible and all the meta data goes into some big json attribute?

I guess we're approaching the problem from different perspectives then. I'm looking at (i) how a user would use the API to access WN values/information vs (ii) you're thinking more of how a WN dev would want to use to edit/read the WN through the API.

fcbond commented 4 years ago

Some of the meta data may be useful to users --- for example for confidence, users may want to only use entries with a confidence > alpha. But I agree most of it is not so important if you just want the semantic graph.

On Wed, Jun 10, 2020 at 9:53 PM alvations notifications@github.com wrote:

Lets merge this first and then iterate from there?

Currently, I think it's best to use as many native types/objects as possible. E.g. maybe definition doesn't need to be its own class but it looks like everything has some sort of meta-data attached to them.

From a user perspective, I think these metadata are not necessary. I.e. if a user intends to fetch a synset's definition, just a str type without all the meta is okay.

What if we do something a little different, WordNet objects are kept as native as possible and all the meta data goes into some big json attribute?

I guess we're approaching the problem from different perspectives then. I'm looking at (i) how a user would use the API to access WN values vs (ii) you're thinking more of how a WN dev would want to use to edit/read the WN through the API.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alvations/gown/pull/5#issuecomment-642025988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRW7QE5NK5TYLD7MAVLRV6F4XANCNFSM4NZHYWGA .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 4 years ago

I still think any SQL or DB is an overkill =) Even neural stuff with billions of parameters can be handle (with lots of hack).

That's a totally different category, though. Here we have a data resource and we're providing an interface for querying it. I agree that DBs are a bit of a headache but it's actually an appropriate use here. It's not the only way to achieve what we want, but I'd like to have faster startup and queries than reading the XML gives us. I'm open to alternatives.

Actually on NamedTuple?

Actually what? Sorry I don't quite follow. And note that typing.NamedTuple is different from collection.namedtuple. It's just a convenient way to build typed data structures. Also see my next response.

Why should WordNet objects be subclassed on namedtuples? Shouldn't these objects get functions as well?

You can put methods on NamedTuples. You can't subclass them to add more members, a property it shares with regular namedtuple. But I think you missed the point of all these structures, which is to model, type-check, and validate the LMF data. The WordNet objects would be constructed using these, but these are not the final product. wn.lmf is just a parser for the LMF files. If we want something simpler but less "correct" (in a broad sense) we could just use regular tuples and dicts.

Currently, I think it's best to use as many native types/objects as possible. E.g. maybe definition doesn't need to be its own class but it looks like everything has some sort of meta-data attached to them.

Agreed about native types. NamedTuple is just one step removed; it's basically a regular tuple but the type information helps resolve hard-to-find bugs. I think it's useful for the LMF parser to read the full definition objects (and others), but for the WordNet interface it's probably fine to drop a lot of the extra stuff (as Francis said).

I guess we're approaching the problem from different perspectives then. I'm looking at (i) how a user would use the API to access WN values/information vs (ii) you're thinking more of how a WN dev would want to use to edit/read the WN through the API.

No, actually I'm looking at it from the user point of view, too, but this PR is just the LMF reader. I don't intend for the objects it produces to be what the user sees.

alvations commented 4 years ago

Ah, got it on the NamedTuple, it's just for the type-hints. I guess that's because as a reader component, these should be strongly typed to avoid reading errors.

Am I getting the components, right?

LMF Parser: Reads the XML file into some python structure
Interface object: Abstracts away from the nitty-gritties and create a simpler python object for users to access

I was thinking more of just stripping everything and start light with what's needed and then add in what we want to read from LMF later on. But both ways works.

Merge this?

goodmami commented 4 years ago

Yes, let's start an issue or project for API design and overall architecture so we're on the same page.

I'll merge now, then we can change it again later.

alvations / gown

Add LMF reader #5

Lets merge this first and then iterate from there?