Closed kwalcock closed 6 years ago
Regarding WDI and the duplicate paths...
- WDI: # old WDI ontology
- 2005_PPP_conversion_factor:
- description:
- "GDP (LCU per international $)"
- 2005_PPP_conversion_factor:
- description:
- "private consumption (LCU per international $)"
On these the paths are often duplicated, but the descriptions are different. At one point (IIRC) the code threw these into a set based on path and then extracted them. This resulted in a loss of some of the descriptions. I believe that the grounding is intended to be based on those potentially lost descriptions. As we make the paths unique, should the descriptions somehow be combined so that they can all be used, something like this (ignore the syntax which is going to change):
- WDI: # less old WDI ontology
- 2005_PPP_conversion_factor:
- description(s):
- "GDP (LCU per international $)"
- "private consumption (LCU per international $)"
Regarding the IDs... I'm not sure where they came from or what they are intended to do. Are they supposed to be unique within an ontology, between our ontologies, between ontologies built by anyone? At some point we started adding a top level to the path (-UN:, -FAO:, -WDI:). This in combination with the remaining unique path gives us uniqueness across at least our ontologies. This ID is so far never output anywhere. Is the ID supposed to just be shorter than the entire path or a simple way to search in the file for an entry (difficult in a tree)? Can we just call them UN1, UN2, UN3, etc.? If we edit an ontology, our IDs probably won't be consistent across different versions of it. An insertion or deletion would mess things up. UN200 would be "foo" in version 1 of the ontology and may be "bar" in version 2. Maybe that doesn't matter. I wonder whether the ID will be used whatsoever. Any advice?
And just for completeness, we have at least "examples" and "descriptions". These will not be generalized into "texts" or "values" or something, because "descriptions" is our trigger to perform filtering on POS. Rather than indicate this at every single node, assuming all nodes are the same in the ontology, this could be indicated at some higher level. In that case, the branch for field ("texts", "values", etc.) could disappear. Then probably someone will have an entry with both examples and descriptions and mess it up. So, keep it? BTW our code does not very well account for the words "examples" and "descriptions" being potential parts of the path. Should we be escaping in some way?
- UN: # old UN ontology
- events:
- nature_impact:
- positive_nature_impact:
- examples:
- conservation
- sustainable use
Do you mean YAML format instead? So far, all input to Eidos is YAML and output is JSON.
Sorry, I meant YAML!
The old "toy" ontology gets a top level and "examples" so that it doesn't have to use "others" that was seen in the code. Maybe there's an ID. [This has been updated based on comments below.]
- events: # old "toy" ontology
- crisis:
- "crisis"
- "emergency"
- natural:
- weather:
- precipitation:
- "precipitation"
- "rain"
-TOY: # new "toy" ontology
- events:
- OntologyNode:
name: crisis
examples:
- "crisis"
- "emergency"
- natural:
- weather:
- OntologyNode:
name: precipitation
examples:
- "precipitation"
- "rain"
Not much is changed for UN except a "-" if I understand the syntax correctly. [This has been updated based on comments below.]
- UN: # old UN ontology
- events:
- nature_impact:
- positive_nature_impact:
- examples:
- conservation
- sustainable use
- afforestation
- agroforestry
- negative_nature_impact:
- examples:
- pollution
- deforestation
- soil erosion
- desertification
- UN: # new UN ontology
- events:
- nature_impact:
- OntologyNode:
name: positive_nature_impact
examples:
- conservation
- sustainable use
- afforestation
- agroforestry
- OntologyNode:
name: negative_nature_impact
examples:
- pollution
- deforestation
- soil erosion
- desertification
FAO only ever had one description, but since WDI can have multiple, it is a list here. FAO has not used _ instead of spaces for node labels, unlike the other files. We should pick one way or the other. We could make use of this and call our things _examples and _descriptions. [This has been updated based on comments below.]
- FAO: # old FAO ontology
- events:
- Agriculture orientation index:
- Gross Fixed Capital Formation (Agriculture, Forestry and Fishing):
- description:
- "Agriculture orientation index Gross Fixed Capital Formation (Agriculture, Forestry and Fishing)"
- DFA Commitment to Agriculture, Forestry and Fishing:
- description:
- "Agriculture orientation index DFA Commitment to Agriculture, Forestry and Fishing"
- FAO: # new FAO ontology
- events:
- Agriculture orientation index:
- OntologyNode:
name: Gross Fixed Capital Formation (Agriculture, Forestry and Fishing)
descriptions:
- "Agriculture orientation index Gross Fixed Capital Formation (Agriculture, Forestry and Fishing)"
- OntologyNode:
name: DFA Commitment to Agriculture, Forestry and Fishing
descriptions:
- "Agriculture orientation index DFA Commitment to Agriculture, Forestry and Fishing"
Duplicate descriptions are combined for WDI. [This has been updated based on comments below.]
- WDI: # old WDI ontology
- 2005_PPP_conversion_factor:
- description:
- "GDP (LCU per international $)"
- 2005_PPP_conversion_factor:
- description:
- "private consumption (LCU per international $)"
- Access_to_clean_fuels_and_technologies_for_cooking__(%_of_population):
- description:
- "Access to clean fuels and technologies for cooking is the proportion of total population primarily using clean cooking fuels and technologies for cooking. Under WHO guidelines"
- WDI: # new WDI ontology
- OntologyNode:
name: 2005_PPP_conversion_factor
descriptions:
- "GDP (LCU per international $)"
- "private consumption (LCU per international $)"
- OntologyNode:
name: Access_to_clean_fuels_and_technologies_for_cooking__(%_of_population)
descriptions:
- "Access to clean fuels and technologies for cooking is the proportion of total population primarily using clean cooking fuels and technologies for cooking. Under WHO guidelines"
What I had in mind was something closer to Marco's format. For example, the FAO one would look like:
What do you think? Note the explicit OntologyNode. Also, fields not present are simply not listed in the YAML.
Oh, I thought OntologyNode was a general description of what is "positive_nature_impact:" in one place and "negative_nature_impact:" in the next place. I'm very good at misunderstanding. Name: is new and is that last part (terminal) of the path. I like explicit. IDs are apparently gone. I was assuming that we would generate them because they are probably not in the source data.
I like explicit too.
"id" is gone for now. But might be populated in the future, somehow...
@ZhengTang1120, are we inserting the _ in these things or is it in the source data?
nature_impact positive_nature_impact Access_to_clean_fuels_and_technologies_for_cooking__(%_of_population)
These are in the source data. I believe these follow the GSN naming convestion
On Jun 5, 2018, at 10:38 AM, Keith Alcock notifications@github.com wrote:
@ZhengTang1120, are we inserting the _ in these things or is it in the source data?
nature_impact positive_nature_impact Access_to_clean_fuels_and_technologies_for_cooking__(%_of_population)
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
Should we be GSNifying (Game Show Network, I assume) the FAO data which does not appear to follow that convention? Thanks.
We should. This was assigned to Ajay and Zheng. But it’s not the highest priority. At least for the next week or so.
On Jun 5, 2018, at 10:49 AM, Keith Alcock notifications@github.com wrote:
Should we be GSNifying (Game Show Network, I assume) the FAO data which does not appear to follow that convention? Thanks.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
We should generate the GSN format. I don't think either ontology is currently in the GSN format, it’s much more specific than simply underscores
On Tue, Jun 5, 2018 at 8:50 AM Mihai Surdeanu notifications@github.com wrote:
We should. This was assigned to Ajay and Zheng. But it’s not the highest priority. At least for the next week or so.
On Jun 5, 2018, at 10:49 AM, Keith Alcock notifications@github.com wrote:
Should we be GSNifying (Game Show Network, I assume) the FAO data which does not appear to follow that convention? Thanks.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/clulab/eidos/issues/322#issuecomment-394761768, or mute the thread https://github.com/notifications/unsubscribe-auth/AFInifTUe9D3rBBNThkkZ2PfwAbpx7oFks5t5qjbgaJpZM4UafM5 .
@MihaiSurdeanu @kwalcock Actually, '_' was added by us to replace the white space in name.
Ok. Then the GSN formatting still needs to happen. Nevertheless, let’s go with these names for now.
On Jun 5, 2018, at 11:36 AM, Zheng Tang notifications@github.com wrote:
@MihaiSurdeanu @kwalcock Actually, '_' was added by us to replace the white space in name.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I updated the ontologies, please let me know if there is bug in them.
I checked the wdi ontology again. I found that there are commas in the name field. So the duplicates we got is due to the bug in my code. Sorry about that.
If there is just a single description, then it should probably look like.
description: "GDP (LCU per international $)"
I'm adding a couple more comments to your commit.
At some point it was decided that "toy" was not a professional enough name for an ontology. If we need to keep this one at all, can we think of a better name? Can we use "ex" as in "example" or mearning "former".
@kwalcock -- there's a part of me that likes the idea of keeping the older one around, but I don't think that it's a good idea (perhaps just something akin to misplaced nostalgia?). At the very least it adds confusion (just how many ontologies are UA ppl grounding to?) so we should either move or remove it.... @ZhengTang1120 can you do whatever @kwalcock tells you to do with this file (ontology.yml
) -- unless @MihaiSurdeanu wants to weigh in here
I agree with @bsharpataz: we should keep just the latest version of each ontology. We can always go back in time on github to find the older ones.
It sounds like the toy ontology should go. I will also remove code that configuration settings related to it. @ZhengTang1120 doesn't need to update it then, either.
What we discussed this morning probably fits here. @ZhengTang1120, can you please make sure that duplicates are handled correctly in the WDI ontology? For example, there are 4 nodes for Adjusted_net_savings:
even though each has a different description. This may be a bug: if the original names are indeed different, please fix, and list them as different nodes. If the names are the same in the original files, please merge them into a single nodes, and include all descriptions in the merged node.
@kwalcock should be in the loop on this one.
It looks like the duplicates mentioned just above have been fixed in the zheng branch, which is where we've been working. I think a comma in the input complicated something, but it looks better now. I should have realized this possibility earlier in the conversation yesterday.
For the FAO and WDI ontologies which have descriptions, we are filtering 1272 and 1586 sentences respectively through the processor's annotator. FAO's short sentences are taking ~1 minute, but WDI is taking around 6.5 minutes. The result is the very same each time and it seems unnecessary to do this every time the program is called (with the applicable settings which are normally off). I wonder whether this filtering should happen during ontology construction instead. We only need the good words in the file. If necessary, there could be an optional field for the pre-processed descriptions. If it isn't there, then they can be filtered, but if they are there, they are used as is.
If we add preprocessed deps (perhaps optionally), I’d like to suggest that we do fairly minimal editing before or after and just store the whole Document. Alternatively, you can try using a mkPartialAnnotation method. We sometimes use those to speed things up when we don’t need everything (like coference resolution, etc). If you want to go this route I can point out an example, or you can grep for mkPartialAnnotation in the code base.
On Tue, Jun 12, 2018 at 5:40 AM Keith Alcock notifications@github.com wrote:
For the FAO and WDI ontologies which have descriptions, we are filtering 1272 and 1586 sentences respectively through the processor's annotator. FAO's short sentences are taking ~1 minute, but WDI is taking around 6.5 minutes. The result is the very same each time and it seems unnecessary to do this every time the program is called (with the applicable settings which are normally off). I wonder whether this filtering should happen during ontology construction instead. We only need the good words in the file. If necessary, there could be an optional field for the pre-processed descriptions. If it isn't there, then they can be filtered, but if they are there, they are used as is.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/clulab/eidos/issues/322#issuecomment-396574868, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIniXtwIBkckATISOW0Su_Dm3_JWq3xks5t77bQgaJpZM4UafM5 .
I agree filtering should happen offline. And if this happens offline, we don't need to store annotations, no? Am I missing something?
I will check out mkPartialAnnotation. It may make a more extensive change less urgent.
@bsharpataz, I don't see a mkPartialAnnotation anywhere in eidos or processors. Can I be grepping wrong?
Nah, I wasn’t sure where all they were. I think one may still exist in the meanteacher branch in an app called...... DumpPaths They exist in the sia repo, and prob my subproject (qaj) of the research repo
But they basically follow processors README/docs: https://github.com/clulab/processors/blob/master/README.md#using-individual-annotators
I’m pretty sure we only need up through parse(), nothing past like discourse or semantic roles.
On Thu, Jun 14, 2018 at 4:40 AM Keith Alcock notifications@github.com wrote:
@bsharpataz https://github.com/bsharpataz, I don't see a mkPartialAnnotation anywhere in eidos or processors. Can I be grepping wrong?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/clulab/eidos/issues/322#issuecomment-397264925, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIniVpk8ffP9YkrL7B4N39ho0_u_6UUks5t8ku4gaJpZM4UafM5 .
here's one (forgive the weird indentation): def mkPartialAnnotation(text: String): Document = { val doc = procFast.mkDocument(text) procFast.tagPartsOfSpeech(doc) procFast.lemmatize(doc) procFast.parse(doc) doc.clear() doc }
On Thu, Jun 14, 2018 at 7:26 AM, Rebecca bsharp@email.arizona.edu wrote:
Nah, I wasn’t sure where all they were. I think one may still exist in the meanteacher branch in an app called...... DumpPaths They exist in the sia repo, and prob my subproject (qaj) of the research repo
But they basically follow processors README/docs: https://github.com/clulab/processors/blob/master/README. md#using-individual-annotators
I’m pretty sure we only need up through parse(), nothing past like discourse or semantic roles.
On Thu, Jun 14, 2018 at 4:40 AM Keith Alcock notifications@github.com wrote:
@bsharpataz https://github.com/bsharpataz, I don't see a mkPartialAnnotation anywhere in eidos or processors. Can I be grepping wrong?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/clulab/eidos/issues/322#issuecomment-397264925, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIniVpk8ffP9YkrL7B4N39ho0_u_6UUks5t8ku4gaJpZM4UafM5 .
This is extending PR #311. I decided to make it an issue, even though it may be difficult to find.