dbpedia-spotlight / pignlproc

Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.
17 stars 14 forks source link

Adding Bold Surface Forms #17

Open abhishekg2389 opened 9 years ago

abhishekg2389 commented 9 years ago

Hi @tgalery ,

This is regarding pull request https://github.com/dbpedia-spotlight/dbpedia-spotlight/pull/356. I have made some changes to get the bold mentions out of dump file. Please review the changes. As far as test cases are concerned I have extracted words surrounded by triples single quote (''') from the dump.

tgalery commented 9 years ago

I wonder why the tests are failing, maybe you changed a method name, that doesn't need to be changed?

abhishekg2389 commented 9 years ago

I have also updated the testCase for Boldform Extraction.

tgalery commented 9 years ago

Thanks @abhishekg2389! I will take a look at it when I have some time!

abhishekg2389 commented 9 years ago

Hi @tgalery,

Can you please confirm that surface form store is generated from pignlproc only and nothing else, so that I can start working on creating unseen surface forms as I mentioned in my proposal.

tgalery commented 9 years ago

@abhishekg2389 pignlproc uses generates all the stats we use. It would be good to run the pirnlproc process locally to see whether the generated model indeed has the sfs in bold that you are trying to capture. Running the process on a small wikidump, like danish, would be feasiable on a single machine. Not sure when I will have some spare time though. As for the unseen forms, I would take a step back and implement a loose sf matching function that generates variations based on a match and tries to retrieve all the candidates for all the matches. There was a very old commit of mine that would try to fix some white space issues here https://github.com/dbpedia-spotlight/dbpedia-spotlight/pull/284/files , but I would go beyond that and do the following: (i) get a spot -> (ii) generate a set of possible sfs on the basis of (i) -> (iii) get all the candidates from the sfs in (ii) but adjust the sf probability according to some function that calculates some distance between the original sf in (i) and the sf generated in (ii) .

dav009 commented 9 years ago

There is a small sample of the wikidump in this repo: https://github.com/dbpedia-spotlight/pignlproc/tree/master/src/test/resources You could check if there is any markup for Bold SFs for a particular article and create a test-set

abhishekg2389 commented 9 years ago

Hi @dav009, @tgalery,

I have committed a couple of files above named Brian_Eno.xml (for test case resource) and TestWikipediaLoader.java in which I have written the test case for the xml file. I generated the bold forms locally (4 were generated) and use their starting point and ending point of bold forms for verification of bold form generation. Moreover I'll be uploading output on Danish wikidump soon.

abhishekg2389 commented 9 years ago

@tgalery,

In one of the mails I also suggested a similar idea that instead of generating all unseen SFs for SF store we can generate unseen SFs on the go (during SF/mention and candidate matching) but it might take some time. Moreover I also suggested a probability function in my proposal. So I think we can use that over here. So may I know if I can work on your suggested idea (if you want to).

tgalery commented 9 years ago

Hi @abhishekg2389, thanks for your effort, on the test case, it would be nice to put a bold surface form that starts at, say, char 497, and ends at an index greater than 500, so according to your rule, the beginning of the bold word would be captured but not the end. In this case I would expect your functions not to break, nor to output the substring between indices 497-500.

tgalery commented 9 years ago

On the loose sf matching, doing the generation would is actually kind of easy. The strategy at the moment is this: first we get a spot, if we manage to find candidades in the uppercase SF store, you retrieve its candidates and move them to the disambiguation phase. You need to change that: the pipeline would be : get the spot -> generate new spots -> get all the candidates in both the uppercase SF store and the lowercase sf store. I would advise you to do the generation step in stages: first just generating different case combinations and removing plurals, and then we might move to something more complex. You could base your generator on this https://github.com/dnmilne/wikipediaminer/blob/049b1d4a9568144e9e6704b9090d0579b21b3e2e/wikipedia-miner-core/src/main/java/org/wikipedia/miner/util/text/CaseAccentSimpleTextProcessor.java . Hope it helps

abhishekg2389 commented 9 years ago

Hi @tgalery

According to my rule we will consider boldforms which starts before 500 characters. So I will be considering boldform which will before 500 and end in 500+ but that will be the last one that we will take. So should I change the rule and test case or not? And can you tell me if I have to upload the output from Danish wikidump for verification. Moreover I will start working on generating surface forms and will commit the code soon.

tgalery commented 9 years ago

Hi @abhishekg2389 you the output of the pinlproc process will be a new danish model, it would be good to check some danish wikipedia pages and see whether the bold forms that you captured are indeed there.

tgalery commented 9 years ago

Maybe you can post the zipped model to dropbox or something and post the url here so we can double check.

tgalery commented 9 years ago

And I forgot to mention, if you are working on the spotlight code, could you fork from our repo and create a feature branch from dev, please ?

abhishekg2389 commented 9 years ago

Hi @tgalery,

Please find only bold surface forms here: https://www.dropbox.com/s/3gopxn0csynto8i/bolds.tar.gz?dl=0 In my code we are having a small problem with the cases when the text is bold as well as italics but those cases are very rare. And I will create a feature branch from dev and start working on that.

abhishekg2389 commented 9 years ago

Hi @tgalery,

Have you checked the output that I uploaded on DropBox? Moreover I have a started working on the approach and will come up with the code in a few days.

abhishekg2389 commented 9 years ago

There is one more thing that I would like to ask. For removing plurals there are two options:

  1. Replacing with rules - "s" -> "", "ses" -> "s", "xes" -> "x", "zes" -> "z", "ches" -> "ch", "shes" -> "sh", "men" -> "man", "ies" -> "y"
  2. Use stemmer by John Carroll: https://github.com/knowitall/morpha

In option 1 we might miss some complex like men, children. But besides these rules I have a semi-exhaustive list of plurals to singular conversion. Option 2 might be more exhaustive than in Option 1 but it will be more time consuming. Please suggest which option would be appropriate.

tgalery commented 9 years ago

Hi @abhishekg2389 I'm sorry but I still am swamped. I might try to have a look at the dropbox files Sunday, but it's a bit unlikely. Is your plural remover related to normalizing data in the pignlproc or to the possible loose sf matching you are investigating at the moment?

abhishekg2389 commented 8 years ago

Hi @tgalery,

Have you got any chance to look at the uploaded files?

tgalery commented 8 years ago

Hi @abhishekg2389 sorry for the massive delay, but I haven't had much time to look at things. Might try this or the next week. To be honest, we are trying to move away from pignlproc to a newer system based on spark. The basis of such system relies on a repo called json wikipedia https://github.com/diegoceccarelli/json-wikipedia which creates a json representation of wikipedia. It would be great if you could add a field in the schema that represents the page that captures the bold surface forms. The parse of pages is done by some libs e.g. jwlpro and bliki, so they might have some helper functions already in place.