dalab / end2end_neural_el

Apache License 2.0
231 stars 68 forks source link

Training in a new dataset #15

Open iuria21 opened 5 years ago

iuria21 commented 5 years ago

Hi, first thanks for your work. I have a short question and maybe you could help me:

I'm creating a new dataset, I have data with labeled NER and links to each Entity. I could create a dataset like (instead of Wikipedia link I have a link to a law-code):

As 
we
saw 
in 
the 
Mortgage   B           LJ/2006/172   1234
Law        I           LJ/2006/172   1234
...

Can I train a model with this data or do I need something else? There are some columns in aida_train.txt that I don't know what are them. And do you think the entity embedding will be useful in this case also?

Thanks!!

severinsimmler commented 4 years ago

+1

I would also be interested in this.

severinsimmler commented 4 years ago

@basque21, did you have any success with training a model on your own data?

iuria21 commented 4 years ago

Hi, no, I'm sorry but I didn't get any result nor answer here, so I tried with other models...

octavian-ganea commented 4 years ago

Hi all,

If you use the same format as the aida file in our repo, that should work. Did you try that?

octavian-ganea commented 4 years ago

@NikosKolitsas can you please help these people ? Thanks!

NikosKolitsas commented 4 years ago

Hello, sorry for not answering earlier but I have been working on other things the last years. In this work the Entity Recognition and Disambiguation is done simultaneously and the Entity Vectors play a crucial work in the process. I.e. if you want to run this system in your own domain (which I guess has completely different entities from the ones that exist in Wikipedia) then you should create your own entity vectors for sure. Instructions on how to do that can be found here. Furthermore, another important part of the system is the probabilistic mention - entity map p(e|m) which I guess is also something you have to modify for your domain. Regarding the format of the input files this is the last and easiest thing that you don't have to worry about. In the folder preprocessing you can find code that handles a few different formats (Aida dataset, format, some other xml based format and gerbil) and converts all of them to a common simplified format. The new simplified format can be found in the folder ./data/new_datasets/ So I would recommend to create/convert your dataset to this format directly. In general, the code has implementation details that are targeting the purpose of the paper i.e. NER and ED for the available datasets, with wikipedia concepts, also evaluation with Gerbil, and optimizations in training with tfrecords and was not designed with a plug and play mentality. Another thing you should take care of is the mapping from wiki-ids to neural network ids (wikid2nnid i.e. mapping from concept ids to entity vectors in your entity-embeddings array).