freme-project / e-Entity

Apache License 2.0
1 stars 1 forks source link

support for external taxonomies #46

Closed m1ci closed 8 years ago

m1ci commented 8 years ago

This issue summarizes the efforts towards support for domain specific scenarios in e-Entity. Bellow is the initial email conversion provided by @koidl on Sep 4th 2015:

Wripl connect is a tool for financial analysts to identify key trends and developments in large open web content spaces.

The tool offers topic related 'visual cards' (See attachment). Each card representing a topic has specific terms related to it. Example:

Investment Funds -> Closed-End Funds -> Exchange Trade Funds -> Hedge Funds -> Mutual Funds [...]

Problem: A miss match between the vocabulary used by the domain expert (in this case financial analyst) and the vocabulary used in FREME NER or e-Terminology.

The two main challenges

1) Filtering (only allowing terms that match the domain experts vocabulary and the FREME (both NER and terminology) vocabulary.

2) Matching (match FREME NER and e-Terminology vocabulary to the domain experts vocabulary (for example assign all entities that relate to 'Investment Fund' to investment fund.)

Solution: Currently the main idea...

1) Filtering should be simple but might result in a very small result set limiting the interesting (edge case) content findings.

2) WRIPL provides a list of terms (that are based on feedback with the domain expert) and for each term a wikipedia page link.

This list is provided to NER and e-Terminology which then use the wikipedia pages to train the matching mechanism.

Timeline and road map: To be discussed! We need to develop this road map by mid September to ensure all resources (including wripl) can be moved into the same direction.

Keeping in mind this is a real scenario. We have a financial domain expert within wripl and we are actively reaching out to the industry.

Also once this works for the financial domain we can test the application of the same approach to other domains (e.g. Healthcare which IBM Watson is apparently focusing on).

Also part of this is multilinguality but lets get English working first.

m1ci commented 8 years ago

one more comment from @koidl

Looks like 'http://www.wandinc.com/' is going to give wripl a 90 days evaluation license which would allow us to pull the labels and add wikipages (and some more if useful) to each label for training.

What do you think? Will that help?

koidl commented 8 years ago

I have some movement back from WAND who have custom built taxonomies. They are happy to give us access to the finance taxonomy for 30 days and for eval purpose only.

Before executing this however I think the following needs to be tested/addressed:

1) Create a flat list of labels (not more then 50) with assigned wikipedia links (e.g. Hedge Fund, European Central Bank etc.). 2) How do we deal with synonyms e.g. ECB = European Central Bank 3) Once this works extend the WAND taxonomy with links (a lot of work but worth it) - it comes as SKOS

"The WAND Finance and Investment Taxonomy covers all the major topics and concepts of the financial industry. 1526 categories and 1006 synonyms cover topics relating to asset types, financial intermediaries, financial crimes, regulations, benchmarks, analysis tools, and markets and exchanges."

4) Interesting would be to find out if we would be able to re-engineer the taxonomy just by using Wikipedia. It would have to be in a way that if it works for Finance that it also works for healthcare etc. (no worries this is not focus now just seems very exciting)

m1ci commented 8 years ago

nice, quick question (I'll get back later to the other points): what are the taxonomy entries? terms, entities, mixture of both? Do the taxonomy entries have corresponding Wikipedia article (if they are entities)? or maybe they can be found the the list of Wikipedia categories? or in the DBpedia ontology? or,... somewhere else...

koidl commented 8 years ago

@m1ci I think its just labels in SKOS therefore a hierarchy. They mention synonyms but not sure how its encoded.

What do you think will I execute the eval license so we get access to it?

I am almost sure these are all hand made and therefore a mixture of terms and entities. Also there will be no related pages.

We can be smart here though. Either I point a crawler at wikipedia to pull out matches on the labels or I just do it by hand which is fine for now.

Let me know what you think. I am ready on this side to get it from them and start working on it if you think we should go for it now.

koidl commented 8 years ago

It might be good to devise a battle plan here. My idea at the moment is the following:

  1. wripl privides a list of labels with associated wikipedia pages
  2. wripl provides data from the financial domain
  3. FREME uses both to returns the association between labels and content (1 and 2)
  4. FREME includes references to the entities (in the case the labels can be identified as entities and not categories).
  5. FREME includes a confidence value to the matching of 1 and 2
  6. Testing....
  7. wripl executes the WAND 3 month evaluation license for the finance domain texonomy
  8. wripl adds wikipages to each label in the taxonomy
  9. wripl provides content from the financial domain (same as 2.)
  10. FREME executes 4,5
  11. Testing
  12. Approach is added to the FREME API to enable user created taxonomy matching

Does this make sense? What timeline are we potentially looking at here?

koidl commented 8 years ago

In relation to the test data of the chemical domain I was thinking we could do a simple test.

We use the content we were testing the categories on however this time we use a list to match it to. The list I was thinking about is https://en.wikipedia.org/wiki/List_of_CAS_numbers_by_chemical_compound

It is a bit like a flat list (not exactly a taxonomy) but it would get us started especially because each chemical compound has a page.

The result would then be something like.

Input text: Each description page in the test corpus Output: A list of chemical substances that are associated (mentioned/identified) with the input text and based on the CAS list above.

Would that makes sense?

Its more less the same as what we will be doing with the finance domain or any other domain (e.g. by using the WAND taxonomy http://www.wandinc.com/wand-finance-and-investment-taxonomy.aspx)

Instead of me manually linking the labels of this taxonomy with wikipedia pages we can start testing this approach with the exciting CAS-List.

Its basically point 1 and 2 in the list of my previous post. I will also check with TILDE what they think about this just to keep things alinged

koidl commented 8 years ago

@m1ci does my comment make sense? Would this be a good way to start working towards domain specific FREME NER services? We can also do this with the finance content and I draft up a list of labels with wikipedia links which is more less the same as above. Like this (by using chemical) first we can then test it with finance after and keeping in mind that I will only get a 3 month eval license from WANDS finance taxonomy this way it might make more sense?

jnehring commented 8 years ago

This issue is outdated. We already have a feature for domain specifiy ner now.