GateNLP / gateplugin-LearningFramework

A plugin for the GATE language technology framework for training and using machine learning models. Currently supports Mallet (MaxEnt, NaiveBayes, CRF and others), LibSVM, Scikit-Learn, Weka, and DNNs through Pytorch and Keras.
https://gatenlp.github.io/gateplugin-LearningFramework/
GNU Lesser General Public License v2.1
26 stars 6 forks source link

Chunking using MalletCRF_SEQ_MR - no chunk annotations for last sentence #117

Closed johann-petrak closed 4 years ago

johann-petrak commented 4 years ago

See https://groups.io/g/gate-users/message/496

johann-petrak commented 4 years ago

Could not replicate and no feedback in two weeks, closing.

Jacky-Miu commented 4 years ago

I'm using Gate Developer 8.6.1 and Learning Framework 4.2 in running my Application. My task is to identify a short phrase (S-Subclause) out from a Targeted Sentence (S-Sentence).

I have created a Key Annotation Set that contains the correctly annotated S-Sentence and S-Subclause. So, examples of these clauses are shown below:

Key-S-Sentence

Key-S-Subclause

For the machine training part, I have used the LF_TrainChunking. So, for the Apply part, I have used the Apply_TrainChunking. My Apply PR is quite simple, and can be seen from picture below:

Apply-PR The two jape files are for creating the S-Sentence, and then the LF_ApplyChunking was followed immediately.

The parameters for the ApplyChunking is again nothing sophisticated, and they are as follows:

Apply-parameters

After running the PR over my corpus, the LearningFramework Annotation Set has contained the S-Subclause annotations:

LF-S-Subclause

As can be seen, it has missed the last S-Subclause. However, since I have run the PR over 16 documents, I discovered that this happened to all my 16 documents - the PR has failed to identify the last S-Subclause from all 16 documents.

Upon further review, I noted that the process has also generated a Token annotation at the LF_SEQ_TMP annotation set, so I tried to check info there.

Token-it Above result shows that the word "it" has been labelled S-Subclause|B with a LF_confidence of 0.9674. So, my understanding is that this word "it" should have been the Beginning of the S-Subclause. Obviously, Gate Developer has not created such annotation at the LearningFramework. Please let me know if further info required. Thanks.

Jacky

johann-petrak commented 4 years ago

Thank you for that detailed description! Could you please report what the values of the LF_target feature is for all tokens after that "it" token, especially for the very last token within the S-Sentence annotation? Does that last token end before, exactly at, or after the end of the S-Sentence annotation?

Jacky-Miu commented 4 years ago

The LF_target value for all the tokens after that "it" token is S-Subclause|I, without any difference at all. The last token within that S-Sentence is a Full-stop (.). The LF_target value of this . is also S-Subclause|I. This last token ends exactly at the end of the S-Sentence annotation.

I have also checked a few other S-Sentence - they behaved normally in that the first token of a S-Subclause always has a S-Subclause|B as its LF_target value, and then the last token, which is always a Full-stop (.), has a S-Subclause|I as its LF_target value.

johann-petrak commented 4 years ago

Would it be possible for you to test application of the model with the version 4.3-SNAPSHOT of the plugin? Note that only the version of the plugin that gets loaded first into GATE can get used. If do this by loading a pipeline into GATE, one simple way to change the version of the plugin loaded from the pipeline is to manually update the version number in the XML gapp file. So (after backing up the original or creating a copy for this test!) change this:

        <group>uk.ac.gate.plugins</group>
        <artifact>learningframework</artifact>
        <version>4.2</version>

to this:

        <group>uk.ac.gate.plugins</group>
        <artifact>learningframework</artifact>
        <version>4.3-SNAPSHOT</version>

If the plugin gets loading from restoring your session, maybe unload the plugin before exiting GATE, then after starting GATE again, use the "+" button in the plugin manager to load version 4.3-SNAPSHOT of the plugin if necessary.

Jacky-Miu commented 4 years ago

Great! The 4.3-SNAPSHOT version works fine!

However, I noted that this SNAPSHOT version has removed "Token" annotation from LF_SEQ_TMP. This may not be a good idea, as then less information is available to assist in debugging. Anyway, thank you for rectifying this bug!

johann-petrak commented 4 years ago

Thank you for your detailed report and for helping to find that bug!

You are right about the LF_SEQ_TMP set, it gets cleared so it cannot interfere with multiple application PRs running in sequence and because all those annotations with those features can slow down the (de-)serialization of the document.

However, the LF PRs now have the boolean "debug" parameter, if it is set, the instance annotations in the LF_SEQ_TMP are not removed.

Jacky-Miu commented 4 years ago

I see! If so, that should be fine!

But may I know how this boolean "debug" parameter can be set? Because I can't find this "debug" parameter from the LF_ApplyChunking PR. Thanks again!

johann-petrak commented 4 years ago

This is probably due to the silly default behavior of Maven to update SNAPSHOT releases only once a day. So even though I have changed the snapshot and uploaded it into the repo, maven will not fetch it to your computer until 24h have passed since the last fetch of that snapshot version.

You can either wait 24 hours or change the "updatePolicy" setting for Maven locally or manually delete the related directory from your .m2 local cache directory.

Jacky-Miu commented 4 years ago

I've loaded the 4.3-SNAPSHOT to my Gate Developer today, but still cannot see the "debug" parameter in the LR_ApplyChunking PR. Should it be contained in the Runtime parameters? I can see the same 7 parameters there, but these do not include the "debug" parameter unfortunately.

johann-petrak commented 4 years ago

I am sorry, the jar did not get deployed. It definitely is in the repo now.

Jacky-Miu commented 4 years ago

Still having problem downloading it! Following error message fyi:

gate.util.GateException: couldn't open creole.xml for plugin: Learning Framework at gate.creole.CreoleRegisterImpl.registerPlugin(CreoleRegisterImpl.java:209) at gate.creole.CreoleRegisterImpl.registerPlugin(CreoleRegisterImpl.java:177) at gate.gui.creole.manager.AvailablePlugins.updateAvailablePlugins(AvailablePlugins.java:732) at gate.gui.creole.manager.PluginUpdateManager$2.run(PluginUpdateManager.java:136) Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: Could not find artifact uk.ac.gate.plugins:LearningFramework:jar:4.3-20200319.085157-7 in gate (http://repo.gate.ac.uk/content/groups/public/) at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolve(DefaultArtifactResolver.java:422) at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifacts(DefaultArtifactResolver.java:224) at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifact(DefaultArtifactResolver.java:201) at org.eclipse.aether.internal.impl.DefaultRepositorySystem.resolveArtifact(DefaultRepositorySystem.java:260) at gate.creole.Plugin$Maven.getCreoleXML(Plugin.java:848) at gate.creole.CreoleRegisterImpl.registerPlugin(CreoleRegisterImpl.java:196) ... 3 more Caused by: org.eclipse.aether.transfer.ArtifactNotFoundException: Could not find artifact uk.ac.gate.plugins:LearningFramework:jar:4.3-20200319.085157-7 in gate (http://repo.gate.ac.uk/content/groups/public/) at org.eclipse.aether.connector.basic.ArtifactTransportListener.transferFailed(ArtifactTransportListener.java:48) at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run(BasicRepositoryConnector.java:365) at org.eclipse.aether.util.concurrency.RunnableErrorForwarder$1.run(RunnableErrorForwarder.java:75) at org.eclipse.aether.connector.basic.BasicRepositoryConnector$DirectExecutor.execute(BasicRepositoryConnector.java:583) at org.eclipse.aether.connector.basic.BasicRepositoryConnector.get(BasicRepositoryConnector.java:259) at org.eclipse.aether.internal.impl.DefaultArtifactResolver.performDownloads(DefaultArtifactResolver.java:498) at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolve(DefaultArtifactResolver.java:399) ... 8 more

johann-petrak commented 4 years ago

This is now a different problem. Can you tell when exactly it happens (I assume when you try to load the app that uses that plugin and version?) I tried to use 4.3-SNAPSHOT on a new machine (which was guaranteed not to already have it) and it worked.

When I look at the GATE maven repo directory, it also shows that the jar it was complaining about is actually there: http://repo.gate.ac.uk/content/groups/public/uk/ac/gate/plugins/learningframework/4.3-SNAPSHOT/ The pom is here http://repo.gate.ac.uk/content/groups/public/uk/ac/gate/plugins/learningframework/4.3-SNAPSHOT/learningframework-4.3-20200319.085157-7.pom

Does this problem persist for you when you try again, @Jacky-Miu ?

Jacky-Miu commented 4 years ago

The error message was gotten when I tried to load the Learning Framework 4.3-SNAPSHOT to my Gate Developer. I used the CREOLE Plugin Manager to do the following:

  1. Click "+" to add a new CREOLE plugin
  2. Enter the following in the relevant boxes: Group: uk.ac.gate.plugins Artifact: LearningFramework Version: 4.3-SNAPSHOT Note: I have to use "LearningFramework" in the Artifact box instead of "learningframework" this time, which was different from the very first time when I loaded this 4.3-SNAPSHOT plugin.
  3. The "Learning Framework (4.3-SNAPSHOT)" was shown in the Plugin Manager, so I clicked on this plugin and then clicked "Apply All". The error message was gotten at this time, and the plugin could not be loaded. Note: When I did the above today, I was first able to load the "Learning Framework (4.3-SNAPSHOT)", but not the next time after I've unloaded and reloaded this plugin again. The same error message appeared when I executed step 3 above. So I think I need to clear something first before I can load this plugin.
  4. Even though I was successful in loading the "Learning Framework (4.3-SNAPSHOT)" earlier today, I still could not see the debug parameter in the LF_ApplyChunking PR. So I am wondering if there's something I need to clear first.
    Thank you again for helping in this process! Much appreciated!
ianroberts commented 4 years ago

Note: I have to use "LearningFramework" in the Artifact box instead of "learningframework" this time, which was different from the very first time when I loaded this 4.3-SNAPSHOT plugin.

That is definitely wrong - the artifact ID is learningframework all lower case. I suspect this is what is causing the error, if you're on Windows or Mac where your local filesystem is case-insensitive then Aether will have been able to find the metadata files in your .m2/repository under the mixed-case name (on a case-insensitive filesystem when I ask for LearningFramework/4.3-SNAPSHOT/maven-metadata.xml the it will give me the contents of learningframework/4.3-SNAPSHOT/maven-metadata.xml without causing an error at that stage) but then failing when it tries to fetch the actual JAR files from the remote repository under the mixed-case name (which doesn't exist - on the repository it has to be learningframework in lower case only).

I suggest, if you can, try to delete the whole of your .m2/repository/uk/ac/gate/plugins/learningframework folder and remove any "Learning Framework" plugins from your plugin manager, then try again with the correct group and artifact IDs all in lower case. The .m2 folder is under your home directory, typically C:\Users\<yourname> on Windows or /Users/<yourname> on Mac.

Jacky-Miu commented 4 years ago

Thank you, Ian! I've successfully loaded the updated 4.3-SNAPSHOT following your advice. Thank you again for your and Johann's help.