clulab / eidos

Machine reading system for World Modelers
Apache License 2.0
36 stars 24 forks source link

Ontology file format #322

Closed kwalcock closed 5 years ago

kwalcock commented 6 years ago

This is extending PR #311. I decided to make it an issue, even though it may be difficult to find.

kwalcock commented 6 years ago

Regarding WDI and the duplicate paths...

- WDI: # old WDI ontology
  - 2005_PPP_conversion_factor:
    - description:
      - "GDP (LCU per international $)"
  - 2005_PPP_conversion_factor:
    - description:
      - "private consumption (LCU per international $)"

On these the paths are often duplicated, but the descriptions are different. At one point (IIRC) the code threw these into a set based on path and then extracted them. This resulted in a loss of some of the descriptions. I believe that the grounding is intended to be based on those potentially lost descriptions. As we make the paths unique, should the descriptions somehow be combined so that they can all be used, something like this (ignore the syntax which is going to change):

- WDI: # less old WDI ontology
  - 2005_PPP_conversion_factor:
    - description(s):
      - "GDP (LCU per international $)"
      - "private consumption (LCU per international $)"
kwalcock commented 6 years ago

Regarding the IDs... I'm not sure where they came from or what they are intended to do. Are they supposed to be unique within an ontology, between our ontologies, between ontologies built by anyone? At some point we started adding a top level to the path (-UN:, -FAO:, -WDI:). This in combination with the remaining unique path gives us uniqueness across at least our ontologies. This ID is so far never output anywhere. Is the ID supposed to just be shorter than the entire path or a simple way to search in the file for an entry (difficult in a tree)? Can we just call them UN1, UN2, UN3, etc.? If we edit an ontology, our IDs probably won't be consistent across different versions of it. An insertion or deletion would mess things up. UN200 would be "foo" in version 1 of the ontology and may be "bar" in version 2. Maybe that doesn't matter. I wonder whether the ID will be used whatsoever. Any advice?

kwalcock commented 6 years ago

And just for completeness, we have at least "examples" and "descriptions". These will not be generalized into "texts" or "values" or something, because "descriptions" is our trigger to perform filtering on POS. Rather than indicate this at every single node, assuming all nodes are the same in the ontology, this could be indicated at some higher level. In that case, the branch for field ("texts", "values", etc.) could disappear. Then probably someone will have an entry with both examples and descriptions and mess it up. So, keep it? BTW our code does not very well account for the words "examples" and "descriptions" being potential parts of the path. Should we be escaping in some way?

- UN: # old UN ontology
  - events:
    - nature_impact:
      - positive_nature_impact:
        - examples:
          - conservation
          - sustainable use
MihaiSurdeanu commented 6 years ago
kwalcock commented 6 years ago

Do you mean YAML format instead? So far, all input to Eidos is YAML and output is JSON.

MihaiSurdeanu commented 6 years ago

Sorry, I meant YAML!

kwalcock commented 6 years ago

The old "toy" ontology gets a top level and "examples" so that it doesn't have to use "others" that was seen in the code. Maybe there's an ID. [This has been updated based on comments below.]

- events: # old "toy" ontology
  - crisis:
    - "crisis"
    - "emergency"
  - natural:
    - weather:
      - precipitation:
        - "precipitation"
        - "rain"

-TOY: # new "toy" ontology
  - events:
    - OntologyNode:
      name: crisis
      examples:
        - "crisis"
        - "emergency"
    - natural:
      - weather:
        - OntologyNode:
          name: precipitation
          examples:
            - "precipitation"
            - "rain"
kwalcock commented 6 years ago

Not much is changed for UN except a "-" if I understand the syntax correctly. [This has been updated based on comments below.]

- UN: # old UN ontology
  - events:
    - nature_impact:
      - positive_nature_impact:
        - examples:
          - conservation
          - sustainable use
          - afforestation
          - agroforestry
      - negative_nature_impact:
        - examples:
          - pollution
          - deforestation
          - soil erosion
          - desertification

- UN: # new UN ontology
  - events:
    - nature_impact:
      - OntologyNode:
        name: positive_nature_impact
        examples:
          - conservation
          - sustainable use
          - afforestation
          - agroforestry
      - OntologyNode:
        name: negative_nature_impact
        examples:
          - pollution
          - deforestation
          - soil erosion
          - desertification
kwalcock commented 6 years ago

FAO only ever had one description, but since WDI can have multiple, it is a list here. FAO has not used _ instead of spaces for node labels, unlike the other files. We should pick one way or the other. We could make use of this and call our things _examples and _descriptions. [This has been updated based on comments below.]

- FAO: # old FAO ontology
  - events:
    - Agriculture orientation index:
      - Gross Fixed Capital Formation (Agriculture, Forestry and Fishing):
        - description:
          - "Agriculture orientation index Gross Fixed Capital Formation (Agriculture, Forestry and Fishing)"
      - DFA Commitment to Agriculture, Forestry and Fishing:
        - description:
          - "Agriculture orientation index DFA Commitment to Agriculture, Forestry and Fishing"

- FAO: # new FAO ontology
  - events:
    - Agriculture orientation index:
      - OntologyNode:
        name: Gross Fixed Capital Formation (Agriculture, Forestry and Fishing)
        descriptions: 
          - "Agriculture orientation index Gross Fixed Capital Formation (Agriculture, Forestry and Fishing)"
      - OntologyNode:
        name: DFA Commitment to Agriculture, Forestry and Fishing
        descriptions:
          - "Agriculture orientation index DFA Commitment to Agriculture, Forestry and Fishing"
kwalcock commented 6 years ago

Duplicate descriptions are combined for WDI. [This has been updated based on comments below.]

- WDI: # old WDI ontology
  - 2005_PPP_conversion_factor:
    - description:
      - "GDP (LCU per international $)"
  - 2005_PPP_conversion_factor:
    - description:
      - "private consumption (LCU per international $)"
  - Access_to_clean_fuels_and_technologies_for_cooking__(%_of_population):
    - description:
      - "Access to clean fuels and technologies for cooking is the proportion of total population primarily using clean cooking fuels and technologies for cooking. Under WHO guidelines"

- WDI: # new WDI ontology
  - OntologyNode:
    name: 2005_PPP_conversion_factor
    descriptions:
      - "GDP (LCU per international $)"
      - "private consumption (LCU per international $)"
  - OntologyNode:
    name: Access_to_clean_fuels_and_technologies_for_cooking__(%_of_population)
    descriptions:
      - "Access to clean fuels and technologies for cooking is the proportion of total population primarily using clean cooking fuels and technologies for cooking. Under WHO guidelines"
MihaiSurdeanu commented 6 years ago

What I had in mind was something closer to Marco's format. For example, the FAO one would look like:

What do you think? Note the explicit OntologyNode. Also, fields not present are simply not listed in the YAML.

kwalcock commented 6 years ago

Oh, I thought OntologyNode was a general description of what is "positive_nature_impact:" in one place and "negative_nature_impact:" in the next place. I'm very good at misunderstanding. Name: is new and is that last part (terminal) of the path. I like explicit. IDs are apparently gone. I was assuming that we would generate them because they are probably not in the source data.

MihaiSurdeanu commented 6 years ago

I like explicit too.

"id" is gone for now. But might be populated in the future, somehow...

kwalcock commented 6 years ago

@ZhengTang1120, are we inserting the _ in these things or is it in the source data?

nature_impact positive_nature_impact Access_to_clean_fuels_and_technologies_for_cooking__(%_of_population)

MihaiSurdeanu commented 6 years ago

These are in the source data. I believe these follow the GSN naming convestion

On Jun 5, 2018, at 10:38 AM, Keith Alcock notifications@github.com wrote:

@ZhengTang1120, are we inserting the _ in these things or is it in the source data?

nature_impact positive_nature_impact Access_to_clean_fuels_and_technologies_for_cooking__(%_of_population)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

kwalcock commented 6 years ago

Should we be GSNifying (Game Show Network, I assume) the FAO data which does not appear to follow that convention? Thanks.

MihaiSurdeanu commented 6 years ago

We should. This was assigned to Ajay and Zheng. But it’s not the highest priority. At least for the next week or so.

On Jun 5, 2018, at 10:49 AM, Keith Alcock notifications@github.com wrote:

Should we be GSNifying (Game Show Network, I assume) the FAO data which does not appear to follow that convention? Thanks.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

BeckySharp commented 6 years ago

We should generate the GSN format. I don't think either ontology is currently in the GSN format, it’s much more specific than simply underscores

On Tue, Jun 5, 2018 at 8:50 AM Mihai Surdeanu notifications@github.com wrote:

We should. This was assigned to Ajay and Zheng. But it’s not the highest priority. At least for the next week or so.

On Jun 5, 2018, at 10:49 AM, Keith Alcock notifications@github.com wrote:

Should we be GSNifying (Game Show Network, I assume) the FAO data which does not appear to follow that convention? Thanks.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/clulab/eidos/issues/322#issuecomment-394761768, or mute the thread https://github.com/notifications/unsubscribe-auth/AFInifTUe9D3rBBNThkkZ2PfwAbpx7oFks5t5qjbgaJpZM4UafM5 .

ZhengTang1120 commented 6 years ago

@MihaiSurdeanu @kwalcock Actually, '_' was added by us to replace the white space in name.

MihaiSurdeanu commented 6 years ago

Ok. Then the GSN formatting still needs to happen. Nevertheless, let’s go with these names for now.

On Jun 5, 2018, at 11:36 AM, Zheng Tang notifications@github.com wrote:

@MihaiSurdeanu @kwalcock Actually, '_' was added by us to replace the white space in name.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ZhengTang1120 commented 6 years ago

I updated the ontologies, please let me know if there is bug in them.

I checked the wdi ontology again. I found that there are commas in the name field. So the duplicates we got is due to the bug in my code. Sorry about that.

kwalcock commented 6 years ago

If there is just a single description, then it should probably look like.

description: "GDP (LCU per international $)"

I'm adding a couple more comments to your commit.

kwalcock commented 6 years ago

At some point it was decided that "toy" was not a professional enough name for an ontology. If we need to keep this one at all, can we think of a better name? Can we use "ex" as in "example" or mearning "former".

BeckySharp commented 6 years ago

@kwalcock -- there's a part of me that likes the idea of keeping the older one around, but I don't think that it's a good idea (perhaps just something akin to misplaced nostalgia?). At the very least it adds confusion (just how many ontologies are UA ppl grounding to?) so we should either move or remove it.... @ZhengTang1120 can you do whatever @kwalcock tells you to do with this file (ontology.yml) -- unless @MihaiSurdeanu wants to weigh in here

MihaiSurdeanu commented 6 years ago

I agree with @bsharpataz: we should keep just the latest version of each ontology. We can always go back in time on github to find the older ones.

kwalcock commented 6 years ago

It sounds like the toy ontology should go. I will also remove code that configuration settings related to it. @ZhengTang1120 doesn't need to update it then, either.

MihaiSurdeanu commented 6 years ago

What we discussed this morning probably fits here. @ZhengTang1120, can you please make sure that duplicates are handled correctly in the WDI ontology? For example, there are 4 nodes for Adjusted_net_savings:

https://github.com/clulab/eidos/blob/master/src/main/resources/org/clulab/wm/eidos/ontologies/wdi_ontology.yml#L86

even though each has a different description. This may be a bug: if the original names are indeed different, please fix, and list them as different nodes. If the names are the same in the original files, please merge them into a single nodes, and include all descriptions in the merged node.

@kwalcock should be in the loop on this one.

kwalcock commented 6 years ago

It looks like the duplicates mentioned just above have been fixed in the zheng branch, which is where we've been working. I think a comma in the input complicated something, but it looks better now. I should have realized this possibility earlier in the conversation yesterday.

kwalcock commented 6 years ago

For the FAO and WDI ontologies which have descriptions, we are filtering 1272 and 1586 sentences respectively through the processor's annotator. FAO's short sentences are taking ~1 minute, but WDI is taking around 6.5 minutes. The result is the very same each time and it seems unnecessary to do this every time the program is called (with the applicable settings which are normally off). I wonder whether this filtering should happen during ontology construction instead. We only need the good words in the file. If necessary, there could be an optional field for the pre-processed descriptions. If it isn't there, then they can be filtered, but if they are there, they are used as is.

BeckySharp commented 6 years ago

If we add preprocessed deps (perhaps optionally), I’d like to suggest that we do fairly minimal editing before or after and just store the whole Document. Alternatively, you can try using a mkPartialAnnotation method. We sometimes use those to speed things up when we don’t need everything (like coference resolution, etc). If you want to go this route I can point out an example, or you can grep for mkPartialAnnotation in the code base.

On Tue, Jun 12, 2018 at 5:40 AM Keith Alcock notifications@github.com wrote:

For the FAO and WDI ontologies which have descriptions, we are filtering 1272 and 1586 sentences respectively through the processor's annotator. FAO's short sentences are taking ~1 minute, but WDI is taking around 6.5 minutes. The result is the very same each time and it seems unnecessary to do this every time the program is called (with the applicable settings which are normally off). I wonder whether this filtering should happen during ontology construction instead. We only need the good words in the file. If necessary, there could be an optional field for the pre-processed descriptions. If it isn't there, then they can be filtered, but if they are there, they are used as is.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/clulab/eidos/issues/322#issuecomment-396574868, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIniXtwIBkckATISOW0Su_Dm3_JWq3xks5t77bQgaJpZM4UafM5 .

MihaiSurdeanu commented 6 years ago

I agree filtering should happen offline. And if this happens offline, we don't need to store annotations, no? Am I missing something?

kwalcock commented 6 years ago

I will check out mkPartialAnnotation. It may make a more extensive change less urgent.

kwalcock commented 6 years ago

@bsharpataz, I don't see a mkPartialAnnotation anywhere in eidos or processors. Can I be grepping wrong?

BeckySharp commented 6 years ago

Nah, I wasn’t sure where all they were. I think one may still exist in the meanteacher branch in an app called...... DumpPaths They exist in the sia repo, and prob my subproject (qaj) of the research repo

But they basically follow processors README/docs: https://github.com/clulab/processors/blob/master/README.md#using-individual-annotators

I’m pretty sure we only need up through parse(), nothing past like discourse or semantic roles.

On Thu, Jun 14, 2018 at 4:40 AM Keith Alcock notifications@github.com wrote:

@bsharpataz https://github.com/bsharpataz, I don't see a mkPartialAnnotation anywhere in eidos or processors. Can I be grepping wrong?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/clulab/eidos/issues/322#issuecomment-397264925, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIniVpk8ffP9YkrL7B4N39ho0_u_6UUks5t8ku4gaJpZM4UafM5 .

BeckySharp commented 6 years ago

here's one (forgive the weird indentation): def mkPartialAnnotation(text: String): Document = { val doc = procFast.mkDocument(text) procFast.tagPartsOfSpeech(doc) procFast.lemmatize(doc) procFast.parse(doc) doc.clear() doc }

On Thu, Jun 14, 2018 at 7:26 AM, Rebecca bsharp@email.arizona.edu wrote:

Nah, I wasn’t sure where all they were. I think one may still exist in the meanteacher branch in an app called...... DumpPaths They exist in the sia repo, and prob my subproject (qaj) of the research repo

But they basically follow processors README/docs: https://github.com/clulab/processors/blob/master/README. md#using-individual-annotators

I’m pretty sure we only need up through parse(), nothing past like discourse or semantic roles.

On Thu, Jun 14, 2018 at 4:40 AM Keith Alcock notifications@github.com wrote:

@bsharpataz https://github.com/bsharpataz, I don't see a mkPartialAnnotation anywhere in eidos or processors. Can I be grepping wrong?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/clulab/eidos/issues/322#issuecomment-397264925, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIniVpk8ffP9YkrL7B4N39ho0_u_6UUks5t8ku4gaJpZM4UafM5 .