Closed m1ci closed 9 years ago
@m1ci regarding 2) - Since we will be adding scoring and it'll be part of the NIF output, I guess it makes sense to have the addition of topics on the freme-ner side. But using SPARQL (when we integrate the feature into the API) will be slow and add complexity. We can use something really simple like a map of categories here, or do a fast key-value read from our sqlite3 database.
We will need the returned topics also as plane text label. In the example below we have the underscores... lets me know how best to do this? [...]
http://dbpedia.org/resource/World_Wide_Web_Consortium dcterms:subject dbc:Consortia , dbc:Web_development , dbc:Organizations_established_in_1994 ,
[...]
@koidl thanks for this reminder, yes we will include the topic URI and its label. E.g.,
dbc:World_Wide_Web_Consortium dcterms:subject dbc:Consortia
dbc:Consortia rdfs:label "Consortia"
excerpts from email conversion:
Mail from Milan, Sep 8th 2015
Hi Kevin,
I am sending attached the results from processing of few documents.
Can you please have a look whether the output containing enrichments is OK - enough readable for your and your customers? If it is OK, I'm going to process all the data you sent to us.
The output is same as we agreed in Turin - doc_id and doc_text followed by the entity mentions in that document - for each entity we provide:
* surface form, entity link, entity type, list entity topics.
In the output are included the entities are included only once per document.
Response from Kevin, Sep 8th 2015
Hey Milan
Thanks for this - looks great but not sure if I get it yet....
First line is the text:
"This research deliverable presents an overview of the U.S. heart health ingredients market. The investment highlights depicts the market definitions and the key takeaways. In the overview of the U.S. heart health ingredients, [...]"
This is followed by four values in each row:
A)United States
B)http://dbpedia.org/resource/United_States
C)Location
D)"1776 establishments in the United States,United States,Former British colonies,English-speaking countries and territories,Member states of the United Nations,Member states of NATO,G20 nations,Superpowers,States and territories established in 1776,Liberal democracies,Former confederations,Federal constitutional republics,G7 nations,Republics,G8 nations"
So
A = Entity Label (surface form)
B = Entity Link
C = Entity Type
D = A List of Topics for the Entity Found (In this case "United Sates")
Thats a lot of labels and topics now.... its blowing up more instead of narrowing but that is fine for now. Just need to make sure I got it right?
The strategy to narrow this list would be:
1. Provide a list of 'customer specific labels (aka domain specific taxonomy)' and simply check if the customer labels match with either a surface form or topic. The ones that dont match end up in a list of unmatched labels. This may result in being far too narrow however therefor then 2)
2. Extend the customer/domain specific taxonomy with wikipedia links for each label so that we can train the allocation of suface forms and topics even if they don't match the label 100% as they would have to in 1)
Let me know if this makes sense
Response from Milan on Kevin's email, Sep 8th 2015:
Yes, you got it right.
= domain specific taxonomy
We will need to perform mapping of the domain specific taxonomy to DBpedia Wikipedia categories/topics. Lets first focus on filtering out "non-important" topics.
= narrowing down the list of topics
We can simple compute the "informativeness" of each topic. The assumption is that more frequent topics are less informative than the rare topics. Then, we can use these information to filter out non-informative topics associated with entities in your documents. We can compute this "informativeness" values directly from DBpedia.
What do you think?
... Kevin, the categories are also structured in a form of taxonomy. The taxonomy is also part of DBpedia and we can re-use it. See the "skos:broader" info for the Consortia topic.
Response from Kevin, Sep 8th 2015:
= domain specific taxonomy We will need to perform mapping of the domain specific taxonomy to DBpedia Wikipedia categories/topics. Lets first focus on filtering out "non-important" topics.
Agreed.
= narrowing down the list of topics We can simple compute the "informativeness" of each topic. The assumption is that more frequent topics are less informative than the rare topics. Then, we can use these information to filter out non-informative topics associated with entities in your documents. We can compute this "informativeness" values directly from DBpedia.
I would need to see examples. Yes we really need to narrow. This will be nice for the general purpose play which we decided on solving first. I am a little bit worried though that we cant narrow the entities which seem to be to many too. But lets take it step by step maybe via the categories we can narrow the entities? Again the openCalais Social Tags (which are based on Wikipedia categories apparently) somehow manages to do this really nice (not more then 3-5 terms which mostly make sense). Looking forward to see if our strategy will work.
For the domain specific the narrowing (I think but could be wrong) will need smart mapping (not only strong matching but actual learning) due to many categories being to broad or maybe not even existing.
What do you think?
Exciting stuff - lets do it!
... the conversation will continue in this thread.
I have an idea on how to narrow down the list of topics. Actually we are not interested in categories of entities, but topics of the whole text. So we aggregate the list of entity categories over the whole document and create a table like this: (it contains sample data from an imaginative document about a finance company from New York)
Category | Number of occurences | Percentage of occurences - number of occurences divided by total number of categories |
---|---|---|
Finance | 20 | 46.6 |
Business in New York | 17 | 39.5 |
Companies established in 1961 | 3 | 7 |
English-speaking countries and territories | 2 | 4.7 |
Mercury Prize-winning albums | 1 | 2.3 |
Now we can filter for only the most important categories, e.g. for categories that occur in more then 20% of the entities. Or we just take the top 3 topics. In that way we extract topics for the whole document and filter out some noise.
Looks nice
Whats the status though - just need to know how, when, where :-)
Are we starting a separate thread for the 'domain specific' challenge?
Looks like 'http://www.wandinc.com/' is going to give wripl a 90 days evaluation license which would allow us to pull the labels and add wikipages (and some more if useful) to each label for training.
What do you think? Will that help?
@jnehring: thanks for the idea. We were actually thinking of such ranking of the categories.
@koidl:
Whats the status though - just need to know how, when, where :-)
We have already created counts datasets for the DBpedia categories. We computed the number of entities assigned with the DBpedia categories. Next, we will do the following:
BTW, I'm not sure whether we should do the filtering of categories on the e-Entity side or this processing should be done at the consumer side (Wripl or any other). Please keep in mind that e-Entity is an entity recognition service. E-Entity does entity recognition, possibly ranking of the entities and entity related information. As far as the developments are directly considering the process of entity recognition (spotting, linking, classification, entity ranking), we can provide support. Additional processing on top of this data, also how this data is aggregated and consumed is already out of the scope of e-entity.
Are we starting a separate thread for the 'domain specific' challenge?
Yes, I've created one https://github.com/freme-project/e-Entity/issues/46
Lets keep this issue only for discussion on the assignment of topics to entities in general.
There is one more issue on "scoring mechanism for topics" https://github.com/freme-project/e-Entity/issues/45
@koidl can you please have a look at the output bellow whether it is OK? After the section including the list of entities I added section of top-10 most informative categories. The list of all assigned categories was ranked according to their informativeness - less probable resources are considered to be more specific, and consequently more informative than the more common ones.
1 "This study covers the state of the North American positive displacement pumps market, examining drivers and restraints for growth, pricing, distribution, technology, demand, and end-user trends. Market growth for regional and market segments is forecasted. In addition, an in-depth analysis of the competitive situation including market participant�s market shares is performed. The base year is 2012 with forecasts running through 2019. The market is further divided into three sub-segments including rotary positive displacement pumps, reciprocating positive displacement pumps, and peristaltic positive displacement pumps. A detailed analysis of each of the sub-segments is included.Key Questions This Study Will Answer- Is the market growing? - How long will it continue to grow and at what rate? - Are the existing competitors structured correctly to meet customer needs? - Is this an industry or a market? - Will these companies/products/services continue to exist, or will they be acquired by other companies? - Will the products/services become features in other markets? - How will the structure of the market change over time? Is it ripe for acquisitions? - Are the products/services offered today meeting customer needs, or is additional development needed?" North American http://dbpedia.org/resource/North_America N/A type World Digital Library related,Regions of the Americas,Continents,North America Will http://dbpedia.org/resource/Will_and_testament Person Wills and trusts,Inheritance,Death customs,Common law Questions This Study Will Answer N/A link N/A type N/A categories North America 17.82127572776396 Regions of the Americas 15.94680660984782 Continents 15.406238228485117 Inheritance 12.535873508901712 Wills and trusts 12.266686876086323 Common law 11.971610000848393 Death customs 11.624878514960459 World Digital Library related 8.236939532692272
Milan looks good for this example. Did you test it with something more specific? For example the Wikipedia page of Michael Jackson or the about page of Trinity College Dublin. I just need to see if we are now running the risk of being too high level.
no, I didn't. let me try. and I will send the post the results.
for the first 4 paragraphs from https://en.wikipedia.org/wiki/Michael_Jackson I get following top-10 topics with highest informativeness scores
Marshals 15.94680660984782 Music videos directed by Bob Giraldi 15.94680660984782 Humanities occupations 15.94680660984782 Recipients of Thiri Thudhamma Thingaha 15.94680660984782 Michael Jackson concert tours 15.821275727763961 People acquitted of sex crimes 15.821275727763961 Grand Collars of the Order of Saint James of the Sword 15.705798510344026 Grand Cordons of the Order of Independence (Tunisia) 15.705798510344026 African-American male dancers 15.598883306427513 Magazines established in 1917 15.598883306427513
Its much better however there are a good few things in there that seem strange e.g. 'Magazines established in 1917' or even 'Marshals'
Here the list that comes in from OpenCalais for the same content:
It seems immediately more accurate and more useful. We dont need to be as good as OpenCalais (although I always silently hoped we would be better) however if 'Recipients of Thiri Thudhamma Thingaha' (https://en.wikipedia.org/wiki/Thiri_Thudhamma_Thingaha) comes up in relation to Michael Jackson I might have a small problem.
Also, I am fully aware that the OpenCalais terms are not entities.
However the entities in OpenCalais look a lot more spot on too:
What do you think? Should we test this with some more content?
Here the link to where the opencalais images are coming from:
Its much better however there are a good few things in there that seem strange e.g. 'Magazines established in 1917' or
Magazines established in 1917
is considered since in the text was spotted the entity Forbes
and it has assigned this category
even 'Marshals'
This category occurs since in the text was spotted (incorrectly) the entity Tito
(http://dbpedia.org/resource/Josip_Broz_Tito) which has this Marshals
category assigned.
however if 'Recipients of Thiri Thudhamma Thingaha' (https://en.wikipedia.org/wiki/Thiri_Thudhamma_Thingaha) comes up in relation to Michael Jackson I might have a small problem.
'Recipients of Thiri Thudhamma Thingaha' is topic assigned to the spotted entity Tito
However the entities in OpenCalais look a lot more spot on too:
Here are the FREME NER spotted entities, just for comparison:
"Artist http://dbpedia.org/resource/Artist Organization Jackson 5 http://dbpedia.org/resource/Jackson_5 Organization Conrad Murray http://dbpedia.org/resource/Conrad_Murray Person Bad http://dbpedia.org/resource/Bad_(album) N/A type Jackson http://dbpedia.org/resource/Jackson%252C_Mississippi Person Grammy Lifetime Achievement Award http://dbpedia.org/resource/Grammy_Lifetime_Achievement_Award N/A type "Thriller" http://dbpedia.org/resource/Thriller_(genre) Person "Scream http://dbpedia.org/resource/Scream_(1996_film) Organization Billboard Hot 100 http://dbpedia.org/resource/Billboard_Hot_100 Organization Grammy Awards http://dbpedia.org/resource/Grammy_Award N/A type Michael Jackson http://dbpedia.org/resource/Michael_Jackson Person Thriller http://dbpedia.org/resource/Thriller_(genre) N/A type "Artist of the Century N/A link N/A type Jackie http://dbpedia.org/resource/Jackie_Jackson Person Forbes http://dbpedia.org/resource/Forbes Organization Joseph Jackson http://dbpedia.org/resource/Joe_Jackson_(manager) Person Jermaine http://dbpedia.org/resource/Jermaine_Jackson Location Los Angeles County Coroner http://dbpedia.org/resource/Los_Angeles_County_Department_of_Medical_Examiner-Coroner Location American http://dbpedia.org/resource/United_States N/A type "Black http://dbpedia.org/resource/Race_and_ethnicity_in_the_United_States_Census Person Songwriters Hall of Fame http://dbpedia.org/resource/Songwriters_Hall_of_Fame Organization Billie Jean http://dbpedia.org/resource/Billie_Jean Person Dangerous http://dbpedia.org/resource/Dangerous_(Michael_Jackson_album) N/A type Grammy Legend Award http://dbpedia.org/resource/Grammy_Legend_Award N/A type Off the Wall http://dbpedia.org/resource/Off_the_Wall_(album) N/A type American Music Awards http://dbpedia.org/resource/American_Music_Award N/A type White http://dbpedia.org/resource/Race_and_ethnicity_in_the_United_States_Census Person Marlon http://dbpedia.org/resource/Marlon_Dingle Person HIStory http://dbpedia.org/resource/History N/A type This Is It http://dbpedia.org/resource/This_Is_It_(concerts) N/A type MTV http://dbpedia.org/resource/MTV Organization "Love Never Felt So Good" N/A link N/A type Jackson family http://dbpedia.org/resource/Jackson_family Person Hot 100 http://dbpedia.org/resource/Billboard_Hot_100 Organization Dance Hall of Fame http://dbpedia.org/resource/Tap_Dance_Hall_of_Fame Organization Tito http://dbpedia.org/resource/Josip_Broz_Tito Person
Yes, there are mistakes, but also there are non-sense in the results of OpenCalais. Examples of wrong spotted entities: promotional tool, artist, dancer, first artist, etc.
What do you think? Should we test this with some more content?
We can, just let us know. Should I process the documents from the chemical domain?
Note that I sent only the top-10 most informative topics assigned to the entities occurring in the document. The list is much longer. We can include, 20, 30...
Currently, all topics are collected, ranked and top-10 is returned. We might try with entity types instead of topics. E.g. types from the DBpedia ontology, Wikidata, YAGO, UMBEL, schema.org.... or combination of all of them. DBpedia onology has 735 entity types, while YAGO, for example, over 350K types. FYI, there are over 900K DBpedia topics (that we are using now).
Lets see what are your thoughts on this.
Interesting - thanks for this. Helps me to understand this more. Some mistakes are fine. Absolutely fine.
Few questions/comments (feel free to respond inline):
1) FREME NER results. Are those all or just a subset above? 2) In your example are you using FREME NER or dbpedia spotlight? 3) The topic above with the percentage. What does that value mean e.g. 15.6? 4) Please try it with the chemical data so we can see how it works (even with a small subset of it) 5) Tell me more about the types you mention. Can we test and compare or is it too much work to investigate. 6) possible related to 5 - For the chemical domain data for example the relationship to the CAS list is useful. For example does the content relate to https://en.wikipedia.org/wiki/List_of_CAS_numbers_by_chemical_compound#B. I am not sure though if these are topics or entities or just a list of labels/links. This might also bring us back to the domain specific which is not the topic here at the moment I guess.
If okay with you lets keep it moving. My feeling is that we need to test different approaches with different content in a systematic way. For example 10 pages of people, 10 pages of organisations, 10 pages of 'general' topic and see how it works with the approach above compared with using entity types.
When using entity types would we get a score back too?
Hope above makes sense?
1) FREME NER results. Are those all or just a subset above?
all
2) In your example are you using FREME NER or dbpedia spotlight?
sure FREME NER.
3) The topic above with the percentage. What does that value mean e.g. 15.6?
It says how i informative the category is. These values are computed based on the information how many entities have this category assigned. Category with less entities are more informative then those with less assigned entities. See all the categories and their scores here http://rv2622.1blu.de/datasets/dbpedia-categories/dbpedia-categories-counts.ttl
4) Please try it with the chemical data so we can see how it works (even with a small subset of it)
OK, will process the chemical data.
5) Tell me more about the types you mention.
Entity types are attached using rdf:type
. See for all the types assigned to http://dbpedia.org/page/Berlin - search for rdf:type
.
Can we test and compare or is it too much work to investigate.
Yes we can. Let me provide an example.
6) possible related to 5 - For the chemical domain data for example the relationship to the CAS list is useful. For example does the content relate to https://en.wikipedia.org/wiki/List_of_CAS_numbers_by_chemical_compound#B.
Hm... lets see what goes out from the FREME NER (which entities) - which entities are spotted. Then, 1) if we know the domain (we know - its chemical) and 2) we know the list of relevant entities (or topics) for this domain - (we might generate such list of relevant entities/topics), we might in a post-processing stage filter out only entities from this domain relevant list of topics.
I am not sure though if these are topics or entities or just a list of labels/links.
I don't know how well will chemical compounds will be recognized as entities. Let me process the data.
I think sometimes it might be hard to decide if you want high or low informativeness. I am thinking about the categories Sports
with a low informativeness and Sports in St. Louis / Missouri
with a high informativeness. In a general purpose topic extraction system, one would want a text to be labeled with topic Sport
. In the sports domain we might be more interested in Sports in St. Louis / Missouri
.
@m1ci sounds good - Like the idea with the post filter by the way. It might not be smart enough though. We need to consider Fuzzy Matching for example (in finance domain in this case) - 'ETF' and 'Exchange Trade Fund' are the same and need to relate to the same label. But again I guess thats the domains specific challenge in which we add links to the taxonomy so that there is learning?
Let me know if I am confusing things ... also no problem if you want to set up a prio/task list for this to make sure we all stay on the same page
@jnehring yes thats the idea. However it would be nice for FREME NER to have a slider that allows the level of informativeness to be adjusted. Then the end user can decide if just sport, just Sports in Louis / Missouri or both
I think sometimes it might be hard to decide if you want high or low informativeness. I am thinking about the categories Sports with a low informativeness and Sports in St. Louis / Missouri with a high informativeness. In a general purpose topic extraction system, one would want a text to be labeled with topic Sport. In the sports domain we might be more interested in Sports in St. Louis / Missouri.
Hm... makes sense, we might reverse the ranking - and include the top-10 non-informative categories :) However, more informative categories better describe the content of the document compared to less informative categories.
We need to consider Fuzzy Matching for example (in finance domain in this case) - 'ETF' and 'Exchange Trade Fund' are the same and need to relate to the same label.
This is task of entity spotting and linking.
But again I guess thats the domains specific challenge in which we add links to the taxonomy
Yes
so that there is learning?
Learning? if we manage to map taxonomy to types/categories then no learning is needed.
@m1ci Sounds good. Ill let you do some testing. Ping me if you need anything. By the way I am talking to Andi over email at the moment to see how e-terminology can help too. I dont want to confuse things though therefore I wont pull this together just yet
Just checking in- whats the status and do we (wripl) need/can do anything to help?
processing the data from the chemical domain, will hand them over by the afternoon.
The data looks a lot better now - thanks for this
These are wikipedia categories right?
What next?
These are wikipedia categories right?
Yes.
What next?
Please check if the Wikipedia categories are OK for your "General Purpose Topic extraction" use case. If yes, then next week we integrate this as part of e-Entity: attach categories to the entities and the corresponding "informativeness" values. Sorting and filtering the top-K categories will be then on wripl side.
Sounds great - I'll get back to you early next week
thanks
Hi
We get some really nice ones such as:
Id: 11 Barnidipine Hcl- Barnidipine Hcl Market Research Report 2011
But then we get some that are mostly off (not all though): Id: 12 Songs about The Troubles (Northern Ireland) Music videos directed by Anton Corbijn Phosphates Iron compounds Song recordings produced by Flood (producer) Songs written by Adam Clayton Songs written by Larry Mullen, Jr. Songs written by The Edge Songs written by Bono U2 songs
Why is for example: Songs about The Troubles (Northern Ireland) coming up with a high confidence? Just that I understand more how it works.
However I suggest we use this and move it to dev on the API. We are getting some problems here with some Entities in general. For example 'NOT' which comes up a lot and makes little sense in the analytics. My hope is that by using the categories the labels in the dashboard analytics will make more sense too.
Let me know what you think. We can also test better once its in the API and we see what the data looks like over all active websites wripl is serving.
In relation to the domain specific I would assume that issues such as 'Songs about The Troubles (Northern Ireland)' would then go away due to FREME knowing that the content is in the domain 'Chemical' which has nothing to do with 'Politic' or 'Entertainment'?
When you are ready I will continue the conversation on the domain specific (especially how to use data such as the CAS list or custom taxonomies).
Good work by the way with the categories! Thanks
@m1ci just wondering what the status is regarding my last post. a) some small miss-spottings and b) when will this be the categories be avaliable via the API and what will the return data structure look like?
Thanks kevin
Why is for example: Songs about The Troubles (Northern Ireland) coming up with a high confidence? Just that I understand more how it works.
I this I already explained this how we do the scoring of the topics: 1) we collect all topics which are associated in we the entities occurring in the document, 2) we sort them and 3) return the top-10 most informative. Topics which are assigned to less entities in DBpedia, are considered as more informative. Those which are assigned to more entities, are considered to be less informative. To compute the informativeness values we use formulas from the information theory. See https://en.wikipedia.org/wiki/Self-information
The dataset with the topics counts and topics informativeness can be downloaded from here http://rv2622.1blu.de/datasets/dbpedia-categories/dbpedia-categories-counts.ttl
b) when will this be the categories be avaliable via the API and what will the return data structure look like?
From today we start with integrating the topics in e-Entity. Will keep you updated. The output you will receive will look like this:
<http://freme-project.eu/#char=0,3>
a nif:Word , nif:String , nif:Phrase , nif:RFC5147String ;
nif:anchorOf "W3C"^^xsd:string ;
nif:beginIndex "0"^^xsd:int ;
nif:endIndex "3"^^xsd:int ;
nif:referenceContext <http://freme-project.eu/#char=0,33> ;
itsrdf:taClassRef <http://nerd.eurecom.fr/ontology#Organization> ;
itsrdf:taIdentRef <http://dbpedia.org/resource/World_Wide_Web_Consortium> .
<http://dbpedia.org/resource/World_Wide_Web_Consortium> dcterms:subject dbc:Consortia ,
dbc:Web_development ,
dbc:Organizations_established_in_1994 ,
dbc:World_Wide_Web_Consortium ,
dbc:Standards_organizations ,
dbc:Web_services ,
dbc:International_nongovernmental_organizations .
<http://dbpedia.org/resource/Category:Consortia> rdfs:label "Consortia" ,
fr:info "15.598883306427513"^^<http://www.w3.org/2001/XMLSchema#double> .
On the client wripl side you'll need to 1) collect the topics, 2) sort them according to the informativeness values, and 3) pick the top-N - you can alone choose N. OK?
a) some small miss-spottings
If you refer to
For example 'NOT'
Its hard to investigate these problems without the source text. I'm sure, this is because you are sending "ugly" text for processing by FREME NER. Also, I have feeling that these strings are not part of a regular sentences. However, hard to say without the source text.
@m1ci thanks for this.
Sounds all good from here. We will investigate the data issue again, however (not to annoy you) we never had any of these issues with OpenCalais therefore we have to investigate deeper. Also in relation to the spotting of 'NOT', for examples, I will dig out some examples to keep us moving.
The new dashboard is received very well here at the conference therefore all good so far.
Not so put pressure on but if you (even tentative) have a timeline for the API integration I can start allocating resources accordingly.
@m1ci hey milan quick question. Do we get a score for each category? Above you only have one or am I getting it wrong?
thanks k.
@m1ci hey milan quick question. Do we get a score for each category? Above you only have one or am I getting it wrong?
Yes, there will be more scores for each category. The above is just an example - one entity with one category. In reality you will receive more entities with associated categories with different scores attached.
We will investigate the data issue again, however (not to annoy you) we never had any of these issues with OpenCalais therefore we have to investigate deeper.
Maybe they did post-processing and remove entities on a "black list".
Also in relation to the spotting of 'NOT', for examples, I will dig out some examples to keep us moving.
Yes, concrete examples are more than welcome.
Not so put pressure on but if you (even tentative) have a timeline for the API integration I can start allocating resources accordingly.
Hopefully by this Friday.
@m1ci Great looking forward to it. Thanks
I can provide examples of the not issue, will do this evening.
On 16 Sep 2015, at 14:50, Kevin Koidl notifications@github.com wrote:
@m1ci Great looking forward to it. Thanks
— Reply to this email directly or view it on GitHub.
I can provide examples of the not issue, will do this evening.
OK, please do.
Hi guys,
Attached is a list of problem texts that return - http://dbpedia.org/resource/Not. Each row contains the anchor and entity, followed by the text.
If you look at the texts, each contains the following sentence - This is a FREE report from Insider Monkey. Credit Card is NOT required. - Which appears to return the entities Credit Card, Not and Free software. I have tried several of this text with the Freme API and get those entities each time.
I know the text is a bit spammy but this a big problem for us.
Thanks for all your help,
j
On 16 September 2015 at 17:05, Milan Dojčinovski notifications@github.com wrote:
I can provide examples of the not issue, will do this evening.
OK, please do.
— Reply to this email directly or view it on GitHub https://github.com/freme-project/e-Entity/issues/44#issuecomment-140789140 .
John McAuley
[{"anchor":"NOT","resource":"http://dbpedia.org/resource/Not"},Biotech Insider Alert - $5 Stock To Hit $40 $200 Million Dollar Healthcare Hedge Fund's #1 Best Idea Right Now The best healthcare hedge fund out there right now is one of the largest shareholders in this biotech stock. The fund returned more than 20% in each of the last 2 years with a virtually fully hedged portfolio, and it's sending out a BUY signal on this biotech stock. Get your FREE REPORT today (retail value of $300) This is a FREE report from Insider Monkey. Credit Card is NOT required. ] [{"anchor":"NOT","resource":"http://dbpedia.org/resource/Not"},By The Motley Fool in News Published: July 3, 2013 at 9:23 am The housing market is definitely on the mend. Depending on how you want to slice the cattle, you can make a lot of money. I think though that not every investor is interested?in or?willing to take on an inordinate amount of risk. Because of this, I am going to lay out some key macroeconomic indicators, and get to the meat of the argument?as to?whether or not investors should even have a position in housing stocks. The economics, can?t ignore these Source: Ycharts The trend is your friend and the housing market is picking back up again. In certain areas of the United States, the amount of money spent on a mortgage is cheaper than the price of rent. Assuming that?the number of people employed increases and the economy continues to recover, the housing recovery should be well on its way. Source: Federal Reserve Going forward, the real?gross domestic product?is projected to grow at a 2.3% to 2.5% rate. If that is the case, investors should position themselves in housing because?the housing stocks would appreciate rapidly in a cyclical economic rebound. KB Home just announced earnings KB Home (NYSE:KBH) reported a fairly strong quarter. The company was able to increase its revenues by 73% year-over-year (this is a significant improvement; I?wonder where all the bears went on this one.) The company is continuing to recover. The company?s deliveries were up by 39% year-over-year, and the average selling price of homes?grew by around 25% year-over-year. The company?s property backlog is up by an additional 19%. Perhaps back logs are up because investors are fearful of missing out on the next leg-up in the property market. The company reported a loss of $0.04 per share versus?a year-ago period loss of $0.31 per share.?It also reported net income growth. Because revenues were up by $221 million, costs were comparatively up by $197 million. The net difference between the two was what contributed to the company?s net income. Analysts on a consensus basis were anticipating the company to report a $0.06 loss for the quarter, but KB Home (NYSE:KBH) beat analyst estimates by $0.02. Investable insights & another alternative Investors should consider buying a home. Ignore bonds and enjoy the safety of an appreciating real estate portfolio. Now I?m not saying that a home should be your only investment; I am saying that homebuilders are selling homes for ever higher prices. You want to buy on an up-trend.?The trend is your friend, after all, and it is obviously the time to own a bit of the American dream. If?owning a home is a little bit risky, however, why not consider?The Home Depot, Inc. (NYSE:HD)? The company is exposed to the housing sector through home improvement sales. After someone buys a home there?s usually a lot to fix, a lot to upgrade, and a lot to buy. Everything from gardening improvement, paint changes, pipe fixes, toilet replacement, and counter top changes can all be done at The Home Depot, Inc. (NYSE:HD). The company?s stock currently trades at a bit of a hefty valuation (with a 20.5 forward earnings multiple.)?In 2012, the company?was able to grow its earnings per share by 21.5%. The growth in earnings was driven by operating profit margins improving by 93 basis points to 10.39%. The company also repurchased $4 billion in shares, which also contributed to earnings-per-share growth. Analysts are pretty optimistic?about the company?s future. Disciplined cost management, paired with stronger macroeconomic indicators and share buybacks, will grow the company?s earnings going forward. The company?s?stock is projected to grow its earnings by 14.61% per year over the next five years. Why?not own the bank? I think that?Bank of America Corp (NYSE:BAC) could be the most well-positioned bank in terms of earnings growth (I?ll have a separate article dedicated towards the financial sector soon.) The company has a large portfolio of higher-risk securities because in all likelihood, higher-rated (safe) securities are being dumped in favor of riskier assets. The risk premium on BBB-rated bonds is 1.57 currently, which is below the long-run average of 1.867. Assuming that Bank of America Corp (NYSE:BAC) accumulated its BBB mortgages and bonds when risk premiums were above the long-run average, you can basically assume that the company is better positioned than other banks. Source: Bank of America Around 57% of the bank?s assets are below a BBB rating, which implies that the bank is less exposed to coupon note depreciation. It is assumed that the interest rates from the lower-rated securities could make up for the bank?s mark-to-market accounting losses from depreciating AAA-rated securities. After all, treasury bonds are AAA-rated assets and those are declining in value right now. What a bank should own are lower-rated securities that pay a higher rate of interest. Those higher rates of interest would make up for the depreciation on higher-quality debt. Fortunately, Bank of America has positioned itself for this already. The CEO, Brian Moynihan, also plans to cut back on spending by $8 billion by the year 2015. This is why analysts on a consensus basis anticipate?that the?company?will?grow its earnings by 23.39% per year over the next five years. The stock has 41.3 earnings ratio right now, which is reasonable when considering the projected rates of growth. Conclusion Investors need exposure to housing in their investment portfolio. Owning an actual house could be the most lucrative choice right now, but there are other options as well. The home ownership population has declined and the total number of households have gone up, so there?s a lot of pent-up demand which can be reflected in the backlog figures presented by KB Home (NYSE:KBH). Using that, as a leading indicator, we can also assume that demand for mortgages and home improvement will be up as well. Therefore, investors should consider a position in?companies such as KB Home (NYSE:KBH), The Home Depot, Inc. (NYSE:HD), and Bank of America Corp (NYSE:BAC). The article The Housing Recovery Is Offering Lucrative Investment Opportunities originally appeared on Fool.com and is written by Alexander Cho. Alexander Cho has no position in any stocks mentioned. The Motley Fool recommends Bank of America and Home Depot. The Motley Fool owns shares of Bank of America. Alexander is a member of The Motley Fool Blog Network ? entries represent the personal opinion of the blogger and are not formally edited. Copyright ? 1995 ? 2013 The Motley Fool, LLC. All rights reserved. The Motley Fool has a disclosure policy . Biotech Insider Alert - $5 Stock To Hit $40 $200 Million Dollar Healthcare Hedge Fund's #1 Best Idea Right Now The best healthcare hedge fund out there right now is one of the largest shareholders in this biotech stock. The fund returned more than 20% in each of the last 2 years with a virtually fully hedged portfolio, and it's sending out a BUY signal on this biotech stock. Get your FREE REPORT today (retail value of $300) This is a FREE report from Insider Monkey. Credit Card is NOT required. ] [{"anchor":"NOT","resource":"http://dbpedia.org/resource/Not"},Biotech Insider Alert - $5 Stock To Hit $40 $200 Million Dollar Healthcare Hedge Fund's #1 Best Idea Right Now The best healthcare hedge fund out there right now is one of the largest shareholders in this biotech stock. The fund returned more than 20% in each of the last 2 years with a virtually fully hedged portfolio, and it's sending out a BUY signal on this biotech stock. Get your FREE REPORT today (retail value of $300) This is a FREE report from Insider Monkey. Credit Card is NOT required. ] [{"anchor":"NOT","resource":"http://dbpedia.org/resource/Not"},Biotech Insider Alert - $5 Stock To Hit $40 $200 Million Dollar Healthcare Hedge Fund's #1 Best Idea Right Now The best healthcare hedge fund out there right now is one of the largest shareholders in this biotech stock. The fund returned more than 20% in each of the last 2 years with a virtually fully hedged portfolio, and it's sending out a BUY signal on this biotech stock. Get your FREE REPORT today (retail value of $300) This is a FREE report from Insider Monkey. Credit Card is NOT required. ] [{"anchor":"NOT","resource":"http://dbpedia.org/resource/Not"},Biotech Insider Alert - $5 Stock To Hit $40 $200 Million Dollar Healthcare Hedge Fund's #1 Best Idea Right Now The best healthcare hedge fund out there right now is one of the largest shareholders in this biotech stock. The fund returned more than 20% in each of the last 2 years with a virtually fully hedged portfolio, and it's sending out a BUY signal on this biotech stock. Get your FREE REPORT today (retail value of $300) This is a FREE report from Insider Monkey. Credit Card is NOT required. ] [{"anchor":"NOT","resource":"http://dbpedia.org/resource/Not"},By Javier Hasse in Commodities , News Published: June 9, 2014 at 11:16 am On Friday June 6, Andrew Brown, CIO at Emerging Capital Partners, was interviewed at CNBC and talked about investment opportunities with great potential in the African continent. Mr. Brown highlights that his private equity firm is a Pan-African investor, which implies that it endows businesses in the entire continent, not only in South Africa, as many assume. In fact, he states, opportunities in South Africa are less interesting that those present in the rest of the continent. In terms of where the opportunities are in the continent, most people talk about Nigeria, ?because of the young demographic and rapidly growing population? (CNBC interviewer). Mr. Brown further explains that everybody tends to focus on Nigeria because ?it?s a single country with a lot of people.? However, Emerging Capital Partners looks beyond this, and seeks to reach the same population size, delivering products and services, by endowing companies with presence in several smaller countries. He continues, ?The dynamic you?re seeing in Nigeria is a dynamic that?s playing out across Africa. The challenge is how you actually build businesses that can operate and address that market need.? When considering investing in Africa, one must take into account that, as a continent, it is growing at 5% per year, and this growth rate is accelerating. Actually, this recently resulted in the Work Bank upgrading its forecast to 6% for the continent. ? But what about the risk? Well, Mr. Brown?s job as a fund manager is to manage that risk in order to get stable returns. ?I can?t tell you there is no country or political risk across Africa, but there are certainly lots of businesses that aren?t really impacted by political risk per se. And then, when we invest, we like to build platform companies that are operating across a number of countries (?) and that provides a diversification not only at the portfolio company level, but then when you aggregate that to the fund level, we have a very diversified portfolio,? Brown assures. Emerging Capital Partners? portfolio comprises investments in 45 out of 54 countries across Africa, and includes telecoms, commodities, and food and drink stocks, amongst others. Its assets under management surpass the $2 billion threshold. Finally, he talks about Africa?s shift towards a consumer-driven economy: ?I think what you?re seeing come through ?Brown assures- is an emerging consumer class and we?re looking to make investments that will provide good quality, well priced, goods and services into that emerging consumer class.? So, maybe, it could be time to consider investing in Africa, and helping this continent, its economy, and its people, often left behind, develop. Watch the full interview: Biotech Insider Alert - $5 Stock To Hit $40 $200 Million Dollar Healthcare Hedge Fund's #1 Best Idea Right Now The best healthcare hedge fund out there right now is one of the largest shareholders in this biotech stock. The fund returned more than 20% in each of the last 2 years with a virtually fully hedged portfolio, and it's sending out a BUY signal on this biotech stock. Get your FREE REPORT today (retail value of $300) This is a FREE report from Insider Monkey. Credit Card is NOT required. ] [{"anchor":"NOT","resource":"http://dbpedia.org/resource/Not"},Biotech Insider Alert - $5 Stock To Hit $40 $200 Million Dollar Healthcare Hedge Fund's #1 Best Idea Right Now The best healthcare hedge fund out there right now is one of the largest shareholders in this biotech stock. The fund returned more than 20% in each of the last 2 years with a virtually fully hedged portfolio, and it's sending out a BUY signal on this biotech stock. Get your FREE REPORT today (retail value of $300) This is a FREE report from Insider Monkey. Credit Card is NOT required. ] [{"anchor":"NOT","resource":"http://dbpedia.org/resource/Not"},By The Motley Fool in News Published: June 19, 2013 at 2:22 pm Microsoft Corporation (NASDAQ: MSFT ) recently announced Office for iOS (but not iPad), a great sign for the company?s future in cloud-based office suites ? if not for the future of the Surface. The move is part of a long-standing trend toward web- and cloud-based document software, which?Google Inc (NASDAQ: GOOG ) pioneered years ago.?Apple Inc. (NASDAQ: AAPL ) is now finally dipping more than a toe in with the unveiling of iWork in Cloud at this year?s WWDC. Let?s take a look at current cloud-based office offerings from Microsoft, Apple Inc. (NASDAQ: AAPL ), and Google, and what they mean for investors. Microsoft Corporation (NASDAQ: MSFT ) Redmond shook things up a couple of years ago with their announcement of a subscription-based Office suite. Dubbed Office 365, the program provides access to the Office suite and other Microsoft products for a variable fee per year. With the release of Microsoft Office 2013, the company went all-in, developing a Home Premium version catering to regular consumers, and an education flavor for students looking to save money. By Microsoft?s account, the suite is selling pretty well , and it?s no wonder ? the price is right, and wide adoption of Office software means it?s the standard in many organizations and industries. This cloud-based suite is one of the things Microsoft is getting right these days, and I think it?s a prescient move that will cement its place as king of the Office suite for a few more years. With a billion Office users , the company has a pretty big hill to stand on. And that?s important for a company whose Windows division has seen flat growth in the wake of the Windows 8 debacle. On the other hand, Office 365 has driven growth in its parent Microsoft Business Division, which was the company?s most profitable last quarter . While the new subscription-based model could mean lower quarterly revenue in the short-term, Microsoft is hoping it will producer bigger margins year-over-year ? and its recent moves should hearten investors who hope Redmond is right. Apple Inc. (NASDAQ: AAPL ) At its recent Worldwide Developers Conference, Apple Inc. (NASDAQ: AAPL ) announced ?iWork for iCloud,? which is a little weird, because iWork was already available (kind of) in the (i)Cloud. The suite had been languishing for years, receiving only incremental updates since the release of ??gulp ? iWork ?09. Sure, Apple Inc. (NASDAQ: AAPL ) put out versions for iPad in 2010, and pushed content to the cloud last July with the release of OS X 10.8 Mountain Lion. But these were incremental changes ? ?nothing to really compete with the Microsoft or Google juggernauts. iWork for iCloud might change that. The big difference here is browser-based editing, which puts the suite in direct competition with Google Docs for the first time. But there are a couple of reasons I think Cuptertino?s offering will still fall short. First of all, Apple Inc. (NASDAQ: AAPL ) made no mention of real-time collaboration. The ability to watch what your colleagues are typing and chat about it is one of the strongest affordances of working in the cloud. Microsoft has promised it, Google (of course) has it, but Apple doesn?t seem concerned. I think it?s a real missed opportunity. Secondly, Apple?s iCloud has been notoriously unreliable . The company famous for simple functionality has failed to live up to Steve Jobs? claim: ?It just works.? Apple will need to make substantial improvements if it wants to convince anyone that iWork in iCloud is the office suite solution they?re looking for. Of course, iWork and the rest of Apple?s software offerings provide only a small fraction of its revenue. Last quarter it made a combined $38 billion from its hardware and only $4.1 billion from software and iTunes Store sales. iWork for iCloud?s value, if it is to provide one, will be to drive sales of Apple?s hardware. We?ll have to wait and see if the new offering is any more successful than the last version. Google Inc (NASDAQ: GOOG ) I don?t need to tell you that Google Docs is popular. But I will anyway. Consulting firm Gartner was surprised to find that between 33% and 50% of cloud-based office users were on Google Docs in 2012 ? compared with 10% in 2007. That?s huge growth and a huge market share for a product competing with the one?called ?Office.? Of course, Google Docs is free (with the exception of their enterprise offerings), meaning the product produces only a little more than 1% of Google?s revenue. Still, like many of Google?s offerings, Docs is about bringing users into the Google ecosystem. Unlike Office, Google has built Docs from the ground up as an online tool, while Microsoft has had to adapt its offerings for the cloud. In some areas, Google might never be able to replicate Office. But for many businesses, Docs might be a viable option. And as working in the cloud becomes more normal, I think you?ll see more and more enterprise customers turning to Google?s solutions for their document, calendar, and email needs. Last year, Google Apps provided $1 billion in revenue for Google. That still makes up only 1.4% of the tech giant?s revenue, but I?m not the only one who expects that number to grow. The bottom line The real competition here is between Google and Microsoft. Both have full-featured cloud-based suites that provide a viable option for enterprise and small-business customers. And many regular consumers are likely to choose either Office or Google Docs for their office suite needs, even if those consumers use Apple products. In some ways, the two companies are competing for different customers. But I think Google will continue to eat into Microsoft?s cloud-based office market share. Nonetheless, Microsoft is making strong moves to solidify its position in a market where complacency can be deadly. And speaking of complacency, Apple has been slow-moving on cloud-based office solutions. One could?ve been forgiven for thinking it had simply given up the fight before this year?s WWDC, where we saw a glimmer of what might be. Still, Apple needs to make big changes before they can hope to provide a cloud-based office solution for anyone but the most dedicated fans. Steven Yenzer owns shares of Apple. The Motley Fool recommends Apple and Google. The Motley Fool owns shares of Apple, Google, and Microsoft. The article Why Apple?s Head Is in the Cloud originally appeared on Fool.com and is written by Steven Yenzer. Steven is a member of The Motley Fool Blog Network ? entries represent the personal opinion of the blogger and are not formally edited. Copyright ? 1995 ? 2013 The Motley Fool, LLC. All rights reserved. The Motley Fool has a disclosure policy . Biotech Insider Alert - $5 Stock To Hit $40 $200 Million Dollar Healthcare Hedge Fund's #1 Best Idea Right Now The best healthcare hedge fund out there right now is one of the largest shareholders in this biotech stock. The fund returned more than 20% in each of the last 2 years with a virtually fully hedged portfolio, and it's sending out a BUY signal on this biotech stock. Get your FREE REPORT today (retail value of $300) This is a FREE report from Insider Monkey. Credit Card is NOT required. ] [{"anchor":"NOT","resource":"http://dbpedia.org/resource/Not"},Published: June 4, 2013 at 9:20 am Editor?s Note: Related tickers: UniPixel Inc (NASDAQ: UNXL ) UniPixel to Feature UniBoss Touch Screen Technology at Computex Taipei in Taiwan on June 4-8, 2013 (Sys-Con) UniPixel Inc (NASDAQ: UNXL ), a provider of Performance Engineered Films? to the touch screen, flexible printed electronics, and lighting and display markets, will attend Computex Taipei in Taiwan on June 4-8, 2013, where it will showcase product samples and prototypes of its UniBoss? pro-cap, multi-touch sensor film. The company will demonstrate its 10.1? and 13.3? UniBoss prototypes, as well as meet with touch-screen customers and supply chain members. While UniBoss offers linear cost scalability from pocket-size mobile devices to large desktop displays, these two prototype form factors target the highest growth segment of the market. Uni-Pixel Stock Rating Reaffirmed by Cowen Securities (UNXL) (DailyPolitical) UniPixel Inc (NASDAQ:UNXL)?s stock had its ?outperform? rating reaffirmed by equities research analysts at Cowen Securities in a research note issued to investors on Monday, Analyst Ratings.Net reports. They currently have a $46.00 price objective on the stock. Cowen Securities? target price points to a potential upside of 202.43% from the company?s current price. A number of other firms have also recently commented on UniPixel Inc (NASDAQ:UNXL). Analysts at Zacks downgraded shares of UniPixel Inc (NASDAQ:UNXL)?from an ?outperform? rating to a ?neutral? rating in a research note to investors on Monday, May 27th. They now have a $28.20 price target on the stock. Uni-Pixel at Center of Possible Securities Fraud Claims Investigation (Benzinga) ?Build a better mousetrap,? so the saying goes, ?and the world will beat a path to your door.? Saying you built a better mousetrap, however, is not the same as actually doing it. In a press release issued Saturday, Ademi & O?Reilly, LLP, announced an investigation into possible securities fraud claims against UniPixel Inc (NASDAQ:UNXL)?that the law firm said resulted from ?inaccurate statements UniPixel Inc (NASDAQ:UNXL)?made regarding its financial performance and future prospects for the period Dec. 7, 2012 to May 30, 2013.? NASDAQ Decliners Watch List: First Solar, Inc. (NASDAQ:FSLR), Uni-Pixel, Inc. (NASDAQ:UNXL), and SolarCity Corporation (NASDAQ:SCTY) Added to Growing Stock Report?s NASDAQ Decliners Watch List. (SBWire) UniPixel Inc (NASDAQ:UNXL)?a company that delivers performance engineered films to the display, touch screen, and flexible electronics market segments in the United States is currently down (-0.66%) on 2,697,581 shares traded after Seeking Alpha Questioned Quality of Touch Mesh. UniPixel Inc (NASDAQ:UNXL)?is currently down (-65.45%) from its recent 52-week high which has prompted Growing Stock Report to add the stock to their NASDAQ Decliners Watch List. Biotech Insider Alert - $5 Stock To Hit $40 $200 Million Dollar Healthcare Hedge Fund's #1 Best Idea Right Now The best healthcare hedge fund out there right now is one of the largest shareholders in this biotech stock. The fund returned more than 20% in each of the last 2 years with a virtually fully hedged portfolio, and it's sending out a BUY signal on this biotech stock. Get your FREE REPORT today (retail value of $300) This is a FREE report from Insider Monkey. Credit Card is NOT required. ] [{"anchor":"NOT","resource":"http://dbpedia.org/resource/Not"},Biotech Insider Alert - $5 Stock To Hit $40 $200 Million Dollar Healthcare Hedge Fund's #1 Best Idea Right Now The best healthcare hedge fund out there right now is one of the largest shareholders in this biotech stock. The fund returned more than 20% in each of the last 2 years with a virtually fully hedged portfolio, and it's sending out a BUY signal on this biotech stock. Get your FREE REPORT today (retail value of $300) This is a FREE report from Insider Monkey. Credit Card is NOT required. ]
scala> sql("SELECT freme_topic, text FROM flattened where freme_topic = '{\"anchor\":\"NOT\",\"resource\":\"http://dbpedia.org/resource/Not\"}' limit 50").collect().foreach(println)
15/09/16 14:44:53 INFO InMemoryColumnarTableScan: Predicate (freme_topic#43 = {"anchor":"NOT","resource":"http://dbpedia.org/resource/Not"}) generates partition filter: ((freme_topic.lowerBound#721 <= {"anchor":"NOT","resource":"http://dbpedia.org/resource/Not"}) && ({"anchor":"NOT","resource":"http://dbpedia.org/resource/Not"} <= freme_topic.upperBound#720))
15/09/16 14:44:53 INFO SparkContext: Starting job: collect at
Attached is a list of problem texts that return - http://dbpedia.org/resource/Not
I cant see the attachment, I think you cant attach files to GitHub issues. I suggest to upload it in this GDrive folder and link to the file from the GitHub issue.
I created a new issue because of NOT being detected as entity. https://github.com/freme-project/e-Entity/issues/49
Ah, ok will amend now.
On 17 September 2015 at 10:25, Jan Nehring notifications@github.com wrote:
Attached is a list of problem texts that return - http://dbpedia.org/resource/Not
I cant see the attachment, I think you cant attach files to GitHub issues. I suggest to upload it in this GDrive folder https://drive.google.com/drive/folders/0B8CeKhHCOSqUfm9aMGM0NlF0VDNFa19ldDNLX21sbE9Vc3NQX1NDdnQwYVdXZFlta0RYR28 and link to the file from the GitHub issue.
I created a new issue because of NOT being detected as entity. #49 https://github.com/freme-project/e-Entity/issues/49
— Reply to this email directly or view it on GitHub https://github.com/freme-project/e-Entity/issues/44#issuecomment-141021425 .
John McAuley
@koidl the topics and associated weights and labels are already provided as part of the output. See http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents/?informat=text&input=Berlin+is+in+Germany.&outformat=turtle&language=en&dataset=dbpedia&enrichement=dbpedia-categories
The categories information are not included by default. You need to add the enrichement=dbpedia-categories
parameter to include also the topics as part of the output.
Any feedback is more than welcome.
We can further improve the results by:
We should open additional issues for feedback or improvements. There is too much and too diverse content in this issue.
The first version of topic detection is implemented so I close this issue. I move the feedback task to #50
Enrich each named entity with list of topics. The list of topics will be derived from DBpedia which refer to
dcterms:subject
information. Output example:This includes two actions:
1) process Wripl data and hand over the results back to Wripl for validation and feedback 1.1) provide the data in TSV 1.2) the data should contain only one record for each entity (remove duplicates) 2) incorporate the feedback and implement this as feature of e-Entity