freme-project / e-Entity

Apache License 2.0
1 stars 1 forks source link

Unreliable Entities #43

Closed x-fran closed 8 years ago

x-fran commented 8 years ago

I'm testing e-Entity to find a way to get the most reliable entities from the content as fastest is posible.

Let's take as a example this well known piece of text.

$content = "
        Madrid (/məˈdrɪd/, Spanish: [maˈðɾið], locally: [maˈðɾiθ, -ˈðɾi]) is a south-western European city and the
        capital and largest municipality of Spain. The population of the city is almost 3.2 million[4] and that of
        the Madrid metropolitan area, around 7 million. It is the third-largest city in the European Union, after
        London and Berlin, and its metropolitan area is the third-largest in the European Union after Paris and
        London.[5][6][7][8] The city spans a total of 604.3 km2 (233.3 sq mi).[9]
        The city is located on the Manzanares River in the centre of both the country and the Community of Madrid
        (which comprises the city of Madrid, its conurbation and extended suburbs and villages); this community
        is bordered by the autonomous communities of Castile and León and Castile-La Mancha. As the capital city of
        Spain, seat of government, and residence of the Spanish monarch, Madrid is also the political, economic and
        cultural centre of Spain.[10] The current mayor is Manuela Carmena from Ahora Madrid.
        The Madrid urban agglomeration has the third-largest GDP[11] in the European Union and its influences
        in politics, education, entertainment, environment, media, fashion, science, culture, and the arts all
        contribute to its status as one of the world's major global cities.[12][13] Due to its economic output,
        high standard of living, and market size, Madrid is considered the major financial centre of Southern
        Europe[14][15] and the Iberian Peninsula; it hosts the head offices of the vast majority of the major
        Spanish companies, such as Telefónica, Iberia and Repsol. Madrid is the 17th most livable city in the
        world according to Monocle magazine, in its 2014 index.[16][17]
        Madrid houses the headquarters of the World Tourism Organization (WTO), belonging to the United Nations
        Organization (UN), the SEGIB, the Organization of Ibero-American States (OEI), and the Public Interest
        Oversight Board (PIOB). It also hosts major international regulators of Spanish: the Standing Committee
        of the Association of Spanish Language Academies, headquarters of the Royal Spanish Academy (RAE), the
        Cervantes Institute and the Foundation of Urgent Spanish (Fundéu BBVA). Madrid organizes fairs such as
        FITUR,[18] ARCO,[19] SIMO TCI[20] and the Cibeles Madrid Fashion Week.[21]
        While Madrid possesses a modern infrastructure, it has preserved the look and feel of many of its historic
        neighbourhoods and streets. Its landmarks include the Royal Palace of Madrid; the Royal Theatre with its
        restored 1850 Opera House; the Buen Retiro Park, founded in 1631; the 19th-century National Library building
        (founded in 1712) containing some of Spain's historical archives; a large number of national museums,[22]
        and the Golden Triangle of Art, located along the Paseo del Prado and comprising three art museums:
        Prado Museum, the Reina Sofía Museum, a museum of modern art, and the Thyssen-Bornemisza Museum, which
        completes the shortcomings of the other two museums.[23] Cibeles Palace and Fountain have become the
        monument symbol of the city.[24][25][26]
        Madrid is home to two world-famous football clubs, Real Madrid and Atlético de Madrid.
        ";

Sending this text as it is we have in our response the issue that we've discussed here https://github.com/freme-project/e-Entity/issues/41 and 48 entities back

array (size=48)
  0 => string 'Spain' (length=5)
  1 => string 'Madrid' (length=6)
  2 => string 'European Union' (length=14)
  3 => string 'Southern
        Europe' (length=23)
  4 => string 'Iberian Peninsula' (length=17)
  5 => string 'Spanish' (length=7)
  6 => string 'Telefónica' (length=11)
  7 => string 'Iberia' (length=6)
  8 => string 'Repsol' (length=6)
  9 => string 'Monocle' (length=7)
  10 => string 'World Tourism Organization' (length=26)
  11 => string 'WTO' (length=3)
  12 => string 'United Nations
        Organization' (length=35)
  13 => string '(' (length=1)
  14 => string 'UN' (length=2)
  15 => string 'Organization of Ibero-American States' (length=37)
  16 => string 'Public Interest
        Oversight Board' (length=39)
  17 => string 'Royal Spanish Academy' (length=21)
  18 => string 'Cervantes Institute' (length=19)
  19 => string 'Foundation of Urgent Spanish' (length=28)
  20 => string 'Fundéu BBVA' (length=12)
  21 => string 'FITUR' (length=5)
  22 => string 'ARCO' (length=4)
  23 => string 'SIMO' (length=4)
  24 => string 'Madrid metropolitan area' (length=24)
  25 => string 'Cibeles Madrid Fashion Week' (length=27)
  26 => string 'Royal Palace of Madrid' (length=22)
  27 => string 'Royal Theatre' (length=13)
  28 => string 'Opera House' (length=11)
  29 => string 'Buen Retiro Park' (length=16)
  30 => string 'National Library' (length=16)
  31 => string 'Golden Triangle of Art' (length=22)
  32 => string 'Paseo del Prado' (length=15)
  33 => string 'Prado Museum' (length=12)
  34 => string 'Reina Sofía Museum' (length=19)
  35 => string 'Thyssen-Bornemisza Museum' (length=25)
  36 => string 'Cibeles Palace' (length=14)
  37 => string 'Fountain' (length=8)
  38 => string 'Real Madrid' (length=11)
  39 => string 'Atlético de Madrid' (length=19)
  40 => string 'London' (length=6)
  41 => string 'Berlin' (length=6)
  42 => string 'Paris' (length=5)
  43 => string 'Manzanares River' (length=16)
  44 => string 'Community of Madrid' (length=19)
  45 => string 'Castile and León' (length=17)
  46 => string 'Castile-La Mancha' (length=17)
  47 => string 'European' (length=8)

A lot of entities right? But we have a lot of things that we don't need e.g:

...
 13 => string '(' (length=1)
...
 20 => string 'Fundéu BBVA' (length=12)
...

What I did is clean up the content.

        $charsToRemoveFromContent = [
            "\n", "\r", "(", ")", "{", "}", "[", "]", "!", "?", "¡", "¿", ".", ",", '"', ":", ";", "=", "*", "\\", "#", "+", "/",
        ];
        // Clean up html tags, non-alphanumeric chars and blank spaces
        $content = preg_replace('/\s+/', ' ', strip_tags(str_replace($charsToRemoveFromContent, " ", htmlspecialchars($content))));
        // Remove non-ascii chars non-printables
        $content = preg_replace('/[[:^print:]]/', '', $content);
        // Remove numbers from string
        $content = preg_replace("/[0-9]/", "", $content);
        // Remove invalid UTF-8 chars
        $content = iconv("UTF-8","UTF-8//IGNORE",$content);

Note: I'm not proud of this code but hey I'm just playing around. :)

Now the content I send to FREME NER is looking like this:

Madrid mdrd Spanish mai locally mai -i is a south-western European city and the capital and largest municipality of Spain The population of the city is almost million and that of the Madrid metropolitan area around million It is the third-largest city in the European Union after London and Berlin and its metropolitan area is the third-largest in the European Union after Paris and London The city spans a total of km sq mi The city is located on the Manzanares River in the centre of both the country and the Community of Madrid which comprises the city of Madrid its conurbation and extended suburbs and villages this community is bordered by the autonomous communities of Castile and Len and Castile-La Mancha As the capital city of Spain seat of government and residence of the Spanish monarch Madrid is also the political economic and cultural centre of Spain The current mayor is Manuela Carmena from Ahora Madrid The Madrid urban agglomeration has the third-largest GDP in the European Union and its influences in politics education entertainment environment media fashion science culture and the arts all contribute to its status as one of the world's major global cities Due to its economic output high standard of living and market size Madrid is considered the major financial centre of Southern Europe and the Iberian Peninsula it hosts the head offices of the vast majority of the major Spanish companies such as Telefnica Iberia and Repsol Madrid is the th most livable city in the world according to Monocle magazine in its index Madrid houses the headquarters of the World Tourism Organization WTO belonging to the United Nations Organization UN the SEGIB the Organization of Ibero-American States OEI and the Public Interest Oversight Board PIOB It also hosts major international regulators of Spanish the Standing Committee of the Association of Spanish Language Academies headquarters of the Royal Spanish Academy RAE the Cervantes Institute and the Foundation of Urgent Spanish Fundu BBVA Madrid organizes fairs such as FITUR ARCO SIMO TCI and the Cibeles Madrid Fashion Week While Madrid possesses a modern infrastructure it has preserved the look and feel of many of its historic neighbourhoods and streets Its landmarks include the Royal Palace of Madrid the Royal Theatre with its restored Opera House the Buen Retiro Park founded in the th-century National Library building founded in containing some of Spain's historical archives a large number of national museums and the Golden Triangle of Art located along the Paseo del Prado and comprising three art museums Prado Museum the Reina Sofa Museum a museum of modern art and the Thyssen-Bornemisza Museum which completes the shortcomings of the other two museums Cibeles Palace and Fountain have become the monument symbol of the city Madrid is home to two world-famous football clubs Real Madrid and Atltico de Madrid

The response from FREME NER:

array (size=33)
  0 => string 'European Union' (length=14)
  1 => string 'Due' (length=3)
  2 => string 'Madrid' (length=6)
  3 => string 'Spanish' (length=7)
  4 => string 'Southern Europe' (length=15)
  5 => string 'Iberian Peninsula' (length=17)
  6 => string 'Monocle' (length=7)
  7 => string 'Cervantes Institute' (length=19)
  8 => string 'FITUR' (length=5)
  9 => string 'ARCO' (length=4)
  10 => string 'SIMO TCI' (length=8)
  11 => string 'Opera House' (length=11)
  12 => string 'Buen Retiro Park' (length=16)
  13 => string 'National Library' (length=16)
  14 => string 'Spain' (length=5)
  15 => string 'Golden Triangle of Art' (length=22)
  16 => string 'Paseo del Prado' (length=15)
  17 => string 'Prado Museum' (length=12)
  18 => string 'Thyssen-Bornemisza Museum' (length=25)
  19 => string 'Cibeles Palace' (length=14)
  20 => string 'Fountain' (length=8)
  21 => string 'London' (length=6)
  22 => string 'Real Madrid' (length=11)
  23 => string 'Berlin' (length=6)
  24 => string 'Paris' (length=5)
  25 => string 'The city' (length=8)
  26 => string 'Manzanares River' (length=16)
  27 => string 'Community of Madrid' (length=19)
  28 => string 'European' (length=8)
  29 => string 'Castile' (length=7)
  30 => string 'Len' (length=3)
  31 => string 'Castile-La Mancha' (length=17)
  32 => string 'Spain  The' (length=10)

33 items long array instead 48, containing only clean and more or less reliable entities. This it will be also much faster to process for FREME NER and for the end users, less storage space if needed.

Imagine that I want to use "Fundéu" or ")" to dynamically build a URL. E.g. "example.com/Fundéu?param=)"

This may be a security issue also.

m1ci commented 8 years ago

1) FREME NER and other services, process data that is sent by the clients. If we data cleansing, we might break other tools, which expect the same length of the output text as the input text. In fact, Fundéu is incorrectly encoded on the client side, so we cant do anything with it.

2) As for the "reliable entities, FREME NER, at the moment, does not perform entity ranking. At the moment, it only performs, entity spotting, linking and classification.

jnehring commented 8 years ago

All text send to FREME should be UTF-8 encoded. I created an issue to put that in the documentation: https://github.com/freme-project/Documentation/issues/55

koidl commented 8 years ago

Hi

We have a problem with the e-entity service.

At the moment ')' shows in the dashboard - see attach

Do we know why that is?

kevin

screen shot 2015-09-11 at 10 28 21
m1ci commented 8 years ago

Do we know why that is?

Because ")" was spotted as entity.

koidl commented 8 years ago

Is it one?

m1ci commented 8 years ago

no, it is not, its mistake. Please provide an example of text and so we can track and address the issue.

koidl commented 8 years ago

Thanks - we are working on it. We will send examples shortly

koidl commented 8 years ago

One example:

http://spooool.ie/news/take-two/11958-take-two-sam-smiths-bond-theme-room-trailer

[{"tag":"Third Man Records","score":1},{"tag":"Jack White","score":1},{"tag":"Room","score":1},{"tag":"Brie Larson","score":1},{"tag":"Lenny Abrahamson","score":1},{"tag":"Emma Donoghue","score":1},{"tag":"Radiohead","score":1},{"tag":"James Bond","score":1},{"tag":"Spectre","score":1},{"tag":"Sam Smith","score":1},{"tag":"\"Writing","score":1},{"tag":"On The Wall","score":1}]

The problem one her is

{"tag":"\"Writing","score":1}

Do you get that too?

m1ci commented 8 years ago

please send us just the text - preferably in a doc. Thanks!

koidl commented 8 years ago

Unfortunately we dont store the text in the db only in solr which is super hard to pull out. Its from the WP plugin which only sends the text in the body tag - and the title of the page too. In any case is FREME not also using URLs now which should bring the same problem? Will I mail the body and title text to you from the examples we find? Also not sure if this will fix it. We get '/' in some cases then '(' in others... should we not think of some kind of filter for special characters? Also SQL query injection might be possible?

m1ci commented 8 years ago

Unfortunately we dont store the text in the db only in solr which is super hard to pull out.

I dont know your schema but via the SOLR admin interface you can query the exact document. A query will look something like: url:"http://spooool.ie/news/take-two/11958-take-two-sam-smiths-bond-theme-room-trailer"

In any case is FREME not also using URLs now which should bring the same problem?

FREME NER is processing only texts. Any markup is not welcome and might influence the entity spotting phase.

Will I mail the body and title text to you from the examples we find?

FREME NER, as well, I think e-Terminology from Tilde, expects pure text. So please, just send us the text.

We get '/' in some cases then '(' in others... should we not think of some kind of filter for special characters?

Lets first find such cases.

Also SQL query injection might be possible?

On which side? Don't understand.

jnehring commented 8 years ago

Also SQL query injection might be possible?

No user submitted data reaches the MySQL database. Right now we use SQL only for user access tokens. So it is almost impossible that FREME is vulnerable for SQL injections from text data send to FREME NER.

@xFran Maybe you are mean SOLR query injections instead of SQL injections? And did you find a (potential) security issue or are you just asking a general question?

koidl commented 8 years ago

I will try to extract some pages - SOLR is messy but I will do my best

SQL query injection would be on the FREME NER side. For example can a user inject a SQL Query that deletes a SOLR core. Reading this: http://www.matrixgroup.net/snackoclock/2013/01/getting-the-most-out-of-solr/#sthash.SieaWK9f.dpuf SOLR is not effected by SQL query injection.

jnehring commented 8 years ago

That would be a SOLR query injection and not a SQL query injection. @nilesh-c I hope you do proper escaping of all data send to SOLR to avoid such vulnerabilities?

koidl commented 8 years ago

yes @jnehring thats right SOLR specific.... as long as there is no DB or anything else picking up the sent content?

m1ci commented 8 years ago

That would be a SOLR query injection and not a SQL query injection. @nilesh-c I hope you do proper escaping of all data send to SOLR to avoid such vulnerabilities?

We use only the entity surface forms when querying Solr, matching surface form which contains dangerous code is IMO nearly impossible.

johnmcauley commented 8 years ago

Hey all,

I am getting a lot of this spotting also. It's on insider monkey which is an absolute nightmare.

Here are two examples:

I can give you about 100,000 though. Strangely I don't get this from http://api.freme-project.eu/doc/0.2/#!/e-Entity/execute_0 only on the dev API.


Example 1

[ http://www.insidermonkey.com/blog/mondelez-international-inc-mdlz-kraft-foods-group-inc-krft-hershey-co-hsy-1-huge-reason-to-diversify-and-buy-this-global-giant-97297/2/,See All

The confectionery category is typically less threatened by private-label competition because loyal consumers are willing to pay up for their favorite sweet treats. Higher-margin confectionery also enjoys faster growth rates. Mondelez primarily competes with big-branded leaders Hershey and Switzerland-based Nestle (OTCBB: NSRGY ) in this segment.

While Nestle boasts a great deal of presence internationally and presents a big threat to Mondelez?s European business, Hershey Co (NYSE:HSY)?doesn?t even come close to its geographic diversity. The more than century-old candymaker derives only 16% of its revenues internationally. But Hershey has recently ramped up spending to boost its international presence. The maker of Kit Kat and Reese?s enjoyed a very successful 2012, with sales up more than 9%. It did so by raising prices and suffering a very small hit to volumes.

On the other hand, Mondelez International Inc (NASDAQ:MDLZ)?s cookie and cracker brands, which include Nabisco and Oreo, are more susceptible to private-label competition, particularly within Europe, where consumer acceptance of private labels is particularly high. Aside from private-label threats, Kellogg Company (NYSE: K ) is a major competitor in these divisions with its Famous Amos, Keebler, and Cheez-It brands. Even though Kellogg derives only one-third of its sales internationally, look for the company to experience continued growth in its established Latin American, European, and Asian markets, while likely pursuing acquisitions in other emerging markets.

Foolish bottom line

Without a doubt, Mondelez faces challenges. But its global diversification, ample international growth opportunities, and desirable product mix offer it plenty of opportunities. And give its competitors a lot to chew on.

Fool contributor Nicole Seghetti owns shares of Mondelez International. The Motley Fool recommends Coca-Cola and H.J. Heinz.

Copyright ? 1995 ? 2013 The Motley Fool, LLC. All rights reserved. The Motley Fool has a disclosure policy .

]



Example 2

[ http://www.insidermonkey.com/blog/hedge-funds-are-betting-on-wesbanco-inc-wsbc-171429/?singlepage=1,By Asma UL Husna in News

Published: June 14, 2013 at 1:23 pm

Is WesBanco, Inc. (NASDAQ: WSBC ) a buy right now? Prominent investors are getting more optimistic. The number of long hedge fund positions moved up by 1 in recent months.

In the financial world, there are dozens of indicators market participants can use to analyze Mr. Market. A pair of the best are hedge fund and insider trading sentiment. At Insider Monkey, our research analyses have shown that, historically, those who follow the top picks of the best fund managers can beat their index-focused peers by a very impressive amount ( see just how much ).

Just as important, optimistic insider trading activity is another way to break down the investments you?re interested in. There are lots of reasons for a bullish insider to sell shares of his or her company, but only one, very obvious reason why they would behave bullishly. Plenty of empirical studies have demonstrated the valuable potential of this strategy if investors know what to do ( learn more here ).

With all of this in mind, we?re going to take a peek at the recent action regarding WesBanco, Inc. (NASDAQ: WSBC ).

How are hedge funds trading WesBanco, Inc. (NASDAQ:WSBC)?

At Q1?s end, a total of 9 of the hedge funds we track were long in this stock, a change of 13% from the previous quarter.?As one would reasonably expect, some big names have been driving this bullishness. Citadel Investment Group , managed by Ken Griffin, initiated the largest position in WesBanco, Inc. (NASDAQ:WSBC). Citadel Investment Group had 0.6 million invested in the company at the end of the quarter.

What do corporate executives and insiders think about WesBanco, Inc. (NASDAQ:WSBC)?

Bullish insider trading is particularly usable when the primary stock in question has experienced transactions within the past six months. Over the latest 180-day time period, WesBanco, Inc. (NASDAQ:WSBC) has experienced zero unique insiders buying, and 2 insider sales ( see the details of insider trades here ).

Let?s go over hedge fund and insider activity in other stocks similar to WesBanco, Inc. (NASDAQ:WSBC). These stocks are Eagle Bancorp, Inc. (NASDAQ: EGBN ), The Bancorp, Inc. (NASDAQ: TBBK ), SCBT Financial Corporation (NASDAQ: SCBT ), City Holding Company (NASDAQ: CHCO ), and United Community Banks Inc (NASDAQ: UCBI ). This group of stocks are the members of the regional ? mid-atlantic banks industry and their market caps match WSBC?s market cap.

Company Name

]

On 11 September 2015 at 10:57, Milan Dojčinovski notifications@github.com wrote:

That would be a SOLR query injection and not a SQL query injection. @nilesh-c https://github.com/nilesh-c I hope you do proper escaping of all data send to SOLR to avoid such vulnerabilities?

We use only the entity surface forms when querying Solr, matching surface form which contains dangerous code is IMO nearly impossible.

— Reply to this email directly or view it on GitHub https://github.com/freme-project/e-Entity/issues/43#issuecomment-139504399 .

John McAuley

m1ci commented 8 years ago

can you please crete .txt for each example so we can re-produce the problem?

johnmcauley commented 8 years ago

Will do, it will be later on.

j

On 11 Sep 2015, at 14:24, Milan Dojčinovski notifications@github.com wrote:

can you please crete .txt for each example so we can re-produce the problem?

— Reply to this email directly or view it on GitHub.

koidl commented 8 years ago

Cant get access to SOLR from here. Will be early next week. Just wondering if a special characters filter might make more sense? Not sure if we will be able to find every faulty character. Also using the categories might also reduce this problem a lot.

jnehring commented 8 years ago

@xFran I tried your example using API documentation. You mentioned two problems:

13 => string '(' (length=1)

This seems to be a mistake in e-Entity. Maybe its better to generally ignore named entities with length 1 then to delete tokens from the text. E.g. the character . might be used by named entity recognition.

... 20 => string 'Fundéu BBVA' (length=12) ...

The special characters look good in the output of the API tester. Maybe the special characters gets broken on the client side?

koidl commented 8 years ago

I will have to check when I get to SOLR

Ignoring length 1 is a good idea

Special characters happen a lot in some pages. We need to filter it somehow I guess.

jnehring commented 8 years ago

I think this issue can be divided in two parts:

  1. Wrongly detected entities like (. I suggest to move this into a new issue.
  2. Broken special chars in the response of FREME NER. I could not reproduce this bug so I assume there is a bug in your client software (see my last comment). @xFran can you please investigate on that?

Then we should close this issue.

m1ci commented 8 years ago

I suggest to move this into a new issue.

+1

Broken special chars in the response of FREME NER. I could not reproduce this bug so I assume there is a bug in your client software (see my last comment). @xFran can you please investigate on that?

Without concrete data we can't help.

x-fran commented 8 years ago

The content I used for testing actually is a copy/paste from wikipedia. Just put Madrid in search field. You will have exactly the same data/content.

We now clean the content before sending it to FREME NER and we also clean up and get rid of any "strange" chars that we can get back in the entity name before using it.

We can close the issue.

jnehring commented 8 years ago

I created #48 because of the wrongly spotted entity (