Closed AlanQuille closed 2 years ago
Yes, you can train coref with custom data. The training instructions are at https://demo.allennlp.org/coreference-resolution. The hard part is that the original data is not freely available, so it's hard to look at.
You probably don't need all of that stuff. It'll be easier to modify the dataset reader for coref to read some other format that you have available. Dataset readers are quite easy.
Thank you very much. I assume that allennlp can train custom data for relation extraction as well?
Hi, I'm facing an issue with training. I am running this command:
allennlp train coref_spanbert_large.jsonnet -s output/
I get the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 157: character maps to <undefined>
I installed allennlp-2.8.0 allennlp-models-2.8.0 and my coref_spanbert_large.jsonnet file is the same as the link provided except I change these three lines:
"train_data_path": std.extVar("train.demo.v4_gold_conll"),
"validation_data_path": std.extVar("dev.demo.v4_gold_conll"),
"test_data_path": std.extVar("test.demo.v4_gold_conll"),
I'm using v4_gold_conll files (which are basically the same as Conll-2012 files), is that acceptable? I am using these because neuralcoref uses these as well.
Also, why do we need std.extVar
? Is that for accessing online files? Thank you very much.
Hi, I have nearly successfully trained the coreference resolver. I did this on a Linux machine. I ran this command in coref/training_config in allennlp-models (I installed both allennlp and allennlp-models from source):
allennlp train coref_spanbert_large.jsonnet -s /test_output
I have this file (train.demo.v4_gold_conll) in train/, test/ and dev/ folders (just for a test train) in the same folder:
#begin document (demo); part 000
demo 0 0 Yes - - - - - - * -
demo 0 1 , - - - - - - * -
demo 0 2 I - - - - - - * (1)
demo 0 3 noticed - - - - - - * -
demo 0 4 that - - - - - - * -
demo 0 5 many - - - - - - * -
demo 0 6 friends - - - - - - * -
demo 0 7 , - - - - - - * -
demo 0 8 around - - - - - - * -
demo 0 9 me - - - - - - * (1)
demo 0 10 received - - - - - - * -
demo 0 11 it - - - - - - * (2)
demo 0 12 . - - - - - - * -
demo 0 0 It - - - - - - * -
demo 0 1 seems - - - - - - * -
demo 0 2 that - - - - - - * -
demo 0 3 almost - - - - - - * -
demo 0 4 everyone - - - - - - * -
demo 0 5 received - - - - - - * -
demo 0 6 this - - - - - - * (2)
demo 0 7 SMS - - - - - - * (2)
#end document
I get this error:
Do you know where I might be going wrong? Could it be #begin document (demo); part 000, should it be #begin document (bc/cctv/00/cctv_0000); part 000?
This is the version of Linux I am running:
Linux logical 5.4.0-90-generic #101-Ubuntu SMP Fri Oct 15 20:00:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
why do we need
std.extVar
?
We use that to read environment variables. You can say export MY_VARIABLE=foo
in your shell, and then read out the value in the config with std.extVar("MY_VARIABLE")
.
UnicodeDecodeError
There is likely something wrong with your input file, maybe a special character that's not encoded correctly? Either way, it sounds like you got past that problem?
I don't know exactly what's going wrong with your input file, but I have two observations:
(S(NP*)
. The line that crashes tries to find the part to the left and to the right of the *
. If there is no *
at all, it will crash. All your columns, except for one, don't have the *
.I used the Inception tool to create the ConLL-2012 files (link here), but the file that you showed above is slightly different from what I got using the Inception tool. I do not get the same columns with the asterisk. Do you think adding 2-3 columns using a script with only an asterisk can solve the issue? Thank you very much.
I don't know what the asterisk is used for in the reader. Probably for constructing parse trees, which you don't need for coref. I think it might be better to make a copy of the reader and modify it. You could get rid of a lot of code in there and only keep the bits you need for coref.
I tried to add 2 more columns with an asterisk, I ran into the same error as before:
The conll file I am using is as follows (note I change it to a .conll file not a .txt file for training):
I then tried to train the sample file you gave, except with #end document at the end (it is as follows):
I got this error:
What do you think is causing the error? It looks like your code needs the entire Ontonotes dataset to do custom training but that makes custom training very difficult. Can you recommend a course of action? Thank you very much
AllenNLP does not need the entire Ontonotes dataset, just a dataset in the right format. But the Ontonotes format is complicated, because it contains a lot of stuff that's unnecessary for training a coref model. I recommend writing your own DatasetReader
that produces data in the right format for the model, but reads a different input format.
I will attempt to write my own DatasetReader. However, in order to do so I need to start with data which trains successfully as a base for my code otherwise I cannot proceed. I humbly request a source for this sample data as the sample data you provided me is not working. If that is not possible, could you guide me for references for finding the data in the right format for the model. Thank you very much.
This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇
Sorry, we are legally not allowed to give out the source data. It's stupid, but that's what it is. It came from https://catalog.ldc.upenn.edu/LDC2013T19. I think you can go to that website, sign up, and then you get a link to download.
This information is at https://demo.allennlp.org/coreference-resolution.
Hi all, Is it possible to train AllenNLP's coreference resolver with custom data? What format does the custom data have to be in? And finally, if it is possible how do I accomplish it? Thank you very much.