Closed x-fran closed 8 years ago
1) FREME NER and other services, process data that is sent by the clients. If we data cleansing, we might break other tools, which expect the same length of the output text as the input text.
In fact, Fundéu
is incorrectly encoded on the client side, so we cant do anything with it.
2) As for the "reliable entities, FREME NER, at the moment, does not perform entity ranking. At the moment, it only performs, entity spotting, linking and classification.
All text send to FREME should be UTF-8 encoded. I created an issue to put that in the documentation: https://github.com/freme-project/Documentation/issues/55
Hi
We have a problem with the e-entity service.
At the moment ')' shows in the dashboard - see attach
Do we know why that is?
kevin
Do we know why that is?
Because ")" was spotted as entity.
Is it one?
no, it is not, its mistake. Please provide an example of text and so we can track and address the issue.
Thanks - we are working on it. We will send examples shortly
One example:
http://spooool.ie/news/take-two/11958-take-two-sam-smiths-bond-theme-room-trailer
[{"tag":"Third Man Records","score":1},{"tag":"Jack White","score":1},{"tag":"Room","score":1},{"tag":"Brie Larson","score":1},{"tag":"Lenny Abrahamson","score":1},{"tag":"Emma Donoghue","score":1},{"tag":"Radiohead","score":1},{"tag":"James Bond","score":1},{"tag":"Spectre","score":1},{"tag":"Sam Smith","score":1},{"tag":"\"Writing","score":1},{"tag":"On The Wall","score":1}]
The problem one her is
{"tag":"\"Writing","score":1}
Do you get that too?
please send us just the text - preferably in a doc. Thanks!
Unfortunately we dont store the text in the db only in solr which is super hard to pull out. Its from the WP plugin which only sends the text in the body tag - and the title of the page too. In any case is FREME not also using URLs now which should bring the same problem? Will I mail the body and title text to you from the examples we find? Also not sure if this will fix it. We get '/' in some cases then '(' in others... should we not think of some kind of filter for special characters? Also SQL query injection might be possible?
Unfortunately we dont store the text in the db only in solr which is super hard to pull out.
I dont know your schema but via the SOLR admin interface you can query the exact document. A query will look something like: url:"http://spooool.ie/news/take-two/11958-take-two-sam-smiths-bond-theme-room-trailer"
In any case is FREME not also using URLs now which should bring the same problem?
FREME NER is processing only texts. Any markup is not welcome and might influence the entity spotting phase.
Will I mail the body and title text to you from the examples we find?
FREME NER, as well, I think e-Terminology from Tilde, expects pure text. So please, just send us the text.
We get '/' in some cases then '(' in others... should we not think of some kind of filter for special characters?
Lets first find such cases.
Also SQL query injection might be possible?
On which side? Don't understand.
Also SQL query injection might be possible?
No user submitted data reaches the MySQL database. Right now we use SQL only for user access tokens. So it is almost impossible that FREME is vulnerable for SQL injections from text data send to FREME NER.
@xFran Maybe you are mean SOLR query injections instead of SQL injections? And did you find a (potential) security issue or are you just asking a general question?
I will try to extract some pages - SOLR is messy but I will do my best
SQL query injection would be on the FREME NER side. For example can a user inject a SQL Query that deletes a SOLR core. Reading this: http://www.matrixgroup.net/snackoclock/2013/01/getting-the-most-out-of-solr/#sthash.SieaWK9f.dpuf SOLR is not effected by SQL query injection.
That would be a SOLR query injection and not a SQL query injection. @nilesh-c I hope you do proper escaping of all data send to SOLR to avoid such vulnerabilities?
yes @jnehring thats right SOLR specific.... as long as there is no DB or anything else picking up the sent content?
That would be a SOLR query injection and not a SQL query injection. @nilesh-c I hope you do proper escaping of all data send to SOLR to avoid such vulnerabilities?
We use only the entity surface forms when querying Solr, matching surface form which contains dangerous code is IMO nearly impossible.
Hey all,
I am getting a lot of this spotting also. It's on insider monkey which is an absolute nightmare.
Here are two examples:
I can give you about 100,000 though. Strangely I don't get this from http://api.freme-project.eu/doc/0.2/#!/e-Entity/execute_0 only on the dev API.
Example 1
The confectionery category is typically less threatened by private-label competition because loyal consumers are willing to pay up for their favorite sweet treats. Higher-margin confectionery also enjoys faster growth rates. Mondelez primarily competes with big-branded leaders Hershey and Switzerland-based Nestle (OTCBB: NSRGY ) in this segment.
While Nestle boasts a great deal of presence internationally and presents a big threat to Mondelez?s European business, Hershey Co (NYSE:HSY)?doesn?t even come close to its geographic diversity. The more than century-old candymaker derives only 16% of its revenues internationally. But Hershey has recently ramped up spending to boost its international presence. The maker of Kit Kat and Reese?s enjoyed a very successful 2012, with sales up more than 9%. It did so by raising prices and suffering a very small hit to volumes.
On the other hand, Mondelez International Inc (NASDAQ:MDLZ)?s cookie and cracker brands, which include Nabisco and Oreo, are more susceptible to private-label competition, particularly within Europe, where consumer acceptance of private labels is particularly high. Aside from private-label threats, Kellogg Company (NYSE: K ) is a major competitor in these divisions with its Famous Amos, Keebler, and Cheez-It brands. Even though Kellogg derives only one-third of its sales internationally, look for the company to experience continued growth in its established Latin American, European, and Asian markets, while likely pursuing acquisitions in other emerging markets.
Foolish bottom line
Without a doubt, Mondelez faces challenges. But its global diversification, ample international growth opportunities, and desirable product mix offer it plenty of opportunities. And give its competitors a lot to chew on.
Fool contributor Nicole Seghetti owns shares of Mondelez International. The Motley Fool recommends Coca-Cola and H.J. Heinz.
Copyright ? 1995 ? 2013 The Motley Fool, LLC. All rights reserved. The Motley Fool has a disclosure policy .
]
Example 2
[ http://www.insidermonkey.com/blog/hedge-funds-are-betting-on-wesbanco-inc-wsbc-171429/?singlepage=1,By Asma UL Husna in News
Published: June 14, 2013 at 1:23 pm
Is WesBanco, Inc. (NASDAQ: WSBC ) a buy right now? Prominent investors are getting more optimistic. The number of long hedge fund positions moved up by 1 in recent months.
In the financial world, there are dozens of indicators market participants can use to analyze Mr. Market. A pair of the best are hedge fund and insider trading sentiment. At Insider Monkey, our research analyses have shown that, historically, those who follow the top picks of the best fund managers can beat their index-focused peers by a very impressive amount ( see just how much ).
Just as important, optimistic insider trading activity is another way to break down the investments you?re interested in. There are lots of reasons for a bullish insider to sell shares of his or her company, but only one, very obvious reason why they would behave bullishly. Plenty of empirical studies have demonstrated the valuable potential of this strategy if investors know what to do ( learn more here ).
With all of this in mind, we?re going to take a peek at the recent action regarding WesBanco, Inc. (NASDAQ: WSBC ).
How are hedge funds trading WesBanco, Inc. (NASDAQ:WSBC)?
At Q1?s end, a total of 9 of the hedge funds we track were long in this stock, a change of 13% from the previous quarter.?As one would reasonably expect, some big names have been driving this bullishness. Citadel Investment Group , managed by Ken Griffin, initiated the largest position in WesBanco, Inc. (NASDAQ:WSBC). Citadel Investment Group had 0.6 million invested in the company at the end of the quarter.
What do corporate executives and insiders think about WesBanco, Inc. (NASDAQ:WSBC)?
Bullish insider trading is particularly usable when the primary stock in question has experienced transactions within the past six months. Over the latest 180-day time period, WesBanco, Inc. (NASDAQ:WSBC) has experienced zero unique insiders buying, and 2 insider sales ( see the details of insider trades here ).
Let?s go over hedge fund and insider activity in other stocks similar to WesBanco, Inc. (NASDAQ:WSBC). These stocks are Eagle Bancorp, Inc. (NASDAQ: EGBN ), The Bancorp, Inc. (NASDAQ: TBBK ), SCBT Financial Corporation (NASDAQ: SCBT ), City Holding Company (NASDAQ: CHCO ), and United Community Banks Inc (NASDAQ: UCBI ). This group of stocks are the members of the regional ? mid-atlantic banks industry and their market caps match WSBC?s market cap.
Company Name
]
On 11 September 2015 at 10:57, Milan Dojčinovski notifications@github.com wrote:
That would be a SOLR query injection and not a SQL query injection. @nilesh-c https://github.com/nilesh-c I hope you do proper escaping of all data send to SOLR to avoid such vulnerabilities?
We use only the entity surface forms when querying Solr, matching surface form which contains dangerous code is IMO nearly impossible.
— Reply to this email directly or view it on GitHub https://github.com/freme-project/e-Entity/issues/43#issuecomment-139504399 .
John McAuley
can you please crete .txt for each example so we can re-produce the problem?
Will do, it will be later on.
j
On 11 Sep 2015, at 14:24, Milan Dojčinovski notifications@github.com wrote:
can you please crete .txt for each example so we can re-produce the problem?
— Reply to this email directly or view it on GitHub.
Cant get access to SOLR from here. Will be early next week. Just wondering if a special characters filter might make more sense? Not sure if we will be able to find every faulty character. Also using the categories might also reduce this problem a lot.
@xFran I tried your example using API documentation. You mentioned two problems:
13 => string '(' (length=1)
This seems to be a mistake in e-Entity. Maybe its better to generally ignore named entities with length 1 then to delete tokens from the text. E.g. the character .
might be used by named entity recognition.
... 20 => string 'Fundéu BBVA' (length=12) ...
The special characters look good in the output of the API tester. Maybe the special characters gets broken on the client side?
I will have to check when I get to SOLR
Ignoring length 1 is a good idea
Special characters happen a lot in some pages. We need to filter it somehow I guess.
I think this issue can be divided in two parts:
(
. I suggest to move this into a new issue.Then we should close this issue.
I suggest to move this into a new issue.
+1
Broken special chars in the response of FREME NER. I could not reproduce this bug so I assume there is a bug in your client software (see my last comment). @xFran can you please investigate on that?
Without concrete data we can't help.
The content I used for testing actually is a copy/paste from wikipedia. Just put Madrid in search field. You will have exactly the same data/content.
We now clean the content before sending it to FREME NER and we also clean up and get rid of any "strange" chars that we can get back in the entity name before using it.
We can close the issue.
I created #48 because of the wrongly spotted entity (
I'm testing e-Entity to find a way to get the most reliable entities from the content as fastest is posible.
Let's take as a example this well known piece of text.
Sending this text as it is we have in our response the issue that we've discussed here https://github.com/freme-project/e-Entity/issues/41 and 48 entities back
A lot of entities right? But we have a lot of things that we don't need e.g:
What I did is clean up the content.
Note: I'm not proud of this code but hey I'm just playing around. :)
Now the content I send to FREME NER is looking like this:
The response from FREME NER:
33 items long array instead 48, containing only clean and more or less reliable entities. This it will be also much faster to process for FREME NER and for the end users, less storage space if needed.
Imagine that I want to use "Fundéu" or ")" to dynamically build a URL. E.g. "example.com/Fundéu?param=)"
This may be a security issue also.