deeppavlov / deeppavlov-gsoc-ideas

4 stars 7 forks source link

Relation Extraction #12

Closed dmitrijeuseew closed 2 years ago

dmitrijeuseew commented 3 years ago

difficulty: medium mentor: @dmitrijeuseew requirements: python useful links: Relationship Extraction in Papers with Code

Description https://docs.google.com/document/d/1Q6Locx2CzXBR_Xop-ysim6bf1L0EH7wHf28XCXkYbQc/edit?usp=sharing

Coding Challenge

Build Relation Prediction component based on two entities in phrase

tathagata-raha commented 3 years ago

@dmitrijeuseew I am exploring this problem and going through related research. Although it might be helpful if you could share some related research papers.

One idea was that to check in a knowledge graph to extract relation. In that case, the knowledge graph might not be exhaustive and we might not get a lot of relation. Another option is to use OpenIE or MinIE to extract relation. I hope I am exploring in the correct direction.

dmitrijeuseew commented 3 years ago

@tathagata-raha One of the approaches is consider relation extraction as classification task. Entities in the text are replaced with corresponding NER tags, this sentence is fed into BERT and CLS-token output vector is fed into dense layer for classification into classes corresponding to relations. Another approach is joint entity and relation extraction (https://arxiv.org/abs/2010.03851)

dmitrijeuseew commented 3 years ago

Also, relation extraction can be considered as question answering task (https://arxiv.org/abs/2010.04829)

tathagata-raha commented 3 years ago

@dmitrijeuseew @danielkornev Can you explain the coding challenge a bit? Like in the GSoC task itself we have to build a relation extraction model right. What is the expected outcome of this coding challenge?

Like suppose in the example, Steve Jobs founded Apple, the output will be {"Steve Jobs", "Apple", "founded"}. So for sentences with two entities, it can be built with simple rule-based models that will detect the subject, verb, and object and give the relation extraction based on that and also deep learning models that will give much more accurate results. So, for the coding challenge are we expected to build simple models to get the hang of relation extraction or should we build deep learning models too?

danielkornev commented 3 years ago

@dmitrijeuseew ?

dmitrijeuseew commented 3 years ago

@tathagata-raha The sentences can contain several (more than two) entities and more than one subject-relation-object triplets. It is rather difficult to write rule-based methods for relation extraction in sentences with multiple entities. So I suggest using deep learning models, for example, https://towardsdatascience.com/bert-s-for-relation-extraction-in-nlp-2c7c3ab487c4 or https://arxiv.org/abs/2010.03851.

potato-patata commented 3 years ago

@dmitrijeuseew can we use 1d-CNN for feature extraction (in this case entities) and train the model with existing dictionary? I had previously worked on similar project where I used to predict whether a sentence is sarcastic or not on Twitter data using lexicon and 1d-cnn approach. Please guide me if this is correct approach or not?

dmitrijeuseew commented 3 years ago

@potato-patata Could you explain please how you are going to encode entities in the sentence?

potato-patata commented 3 years ago

I will be using a library named "spacy", which has more freedom then nltk and provides greater functionalities, it will make encoding easy and will help in NER's and eventually relation extraction. I will attach a sample notebook in some time highlighting the same.

dmitrijeuseew commented 3 years ago

@potato-patata I suggest using NER model from DeepPavlov https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/configs/ner/ner_ontonotes_bert.json

potato-patata commented 3 years ago

Heyy, @dmitrijeuseew can you have a look at the branch "relation_extraction_coding_challenge". I have pushed an sample jupyter notebook which incorporates spacy, Could you please have a look at it.

tathagata-raha commented 3 years ago

@dmitrijeuseew, can you tell me a dataset on which i can train my model? I tried to got for ACE 2005 dataset bt it was not free.

dmitrijeuseew commented 3 years ago

@tathagata-raha You can use TACRED dataset.

dmitrijeuseew commented 3 years ago

@potato-patata Your solution is very good, but it has the limitation of extracting only one relation from the sentence. The sentence can have several relations (for example, in the sentence "Steve Jobs founded Apple in 1976." there are two relations: "founder" and "inception"). I suggest using neural network-based methods for relation extraction.

Daishinkan002 commented 3 years ago

@dmitrijeuseew I'm unable to find Tacred dataset. I got this when I was searching (" You can download TACRED from the https://catalog.ldc.upenn.edu/LDC2018T24. If you are an LDC member, the access will be free; otherwise, an access fee of $25 is needed. "). Need help to download. Thanks in advance.

potato-patata commented 3 years ago

@Daishinkan002 https://github.com/yuhaozhang/tacred-relation follow this repo for downloading tacred dataset. Hope this helps :)

tathagata-raha commented 3 years ago

The dataset contains only 20 training examples. It doesn't make any sense to build a model on top of that.

kirikbandar commented 3 years ago

Hi @dmitrijeuseew, i am interested in working on this task, and have been doing some research on the different ways relation extraction can be done. In this particular case, what are looking at, are we open to any approach (there are a few papers that you have mentioned- different methodologies)? Also, at this point, are we looking more at exploring which method suits our case better, and then drafting a proposal/application for the same, or would you suggest that we just pick a particular approach and start trying to implement the same and push those changes?

dmitrijeuseew commented 3 years ago

@kirikbandar I think that it will be very useful for DeepPavlov library (and a good practice for you) to implement approach with table-sequence encoders (joint entity and relation extraction) (as in https://arxiv.org/abs/2010.03851).

dmitrijeuseew commented 3 years ago

@kirikbandar We can discuss another methods if you would like.

kirikbandar commented 3 years ago

Thanks, @dmitrijeuseew. Alright, I will work on the two encoder approach, and also keep an eye on other methods.

dmitrijeuseew commented 3 years ago

@Daishinkan002 @potato-patata @tathagata-raha You can use also NYT dataset https://github.com/xiangrongzeng/copy_re (no license required).

dmitrijeuseew commented 3 years ago

@kirikbandar Tell me please what Github repository and branch you are going to use for development of the model (to track the progress).

oserikov commented 3 years ago

We're starting regular review sessions of your application proposals drafts.

Every Thursday you can submit Google doc with your proposal (remember the limit of 3 final proposals in total), enable commenter access in the provided link. You won't have to re-submit your proposal(s) if the link stays the same.

Proposals should follow our released template

Our mentors will review them and provide feedback on a weekly basis.

GForm for proposals review: https://forms.gle/2PoHAgv9rjR1fuug7

kirikbandar commented 3 years ago

@dmitrijeuseew I am using this Deepavlov repo (https://github.com/deepmipt/DeepPavlov). I have created a branch called "table_sequence_encoder_based_relation_extraction" with an empty notebook as of now. I shall make my updates on that.

tathagata-raha commented 3 years ago

Hi @dmitrijeuseew, as I have seen the table sequence-based approach has already been implemented here. Do we need to implement it again for the purpose the coding challenge? We can anyways use the same implementation of their for building a DeepPavlav model. The main challenge is how to migrate the logic to DeepPavlov stack. The migration would take a bit of time and has to done carefully(after careful inspection of their code) which I could plan in the GSOC proposal and do it during the GSOC period.

For the purpose of the coding challenge, I am planning to use simple two-way bert base classification method or the Matching the Blanks approach by Google research.

P.S.- The matching the black paper can also be implemented and migrated to DeepPavlov because it also produces top results on a lot of datasets

dmitrijeuseew commented 3 years ago

@tathagata-raha Yes, the challenge is to migrate the model to DeepPavlov configs (so that the model could be trained using deeppavlov command python -m deeppavlov train relation_extraction and launched using python -m deeppavlov interact relation_extraction ). Also, if you have any ideas how to improve the model so that it will show better metrics, you are welcome.

tathagata-raha commented 3 years ago

So my question is are we supposed to migrate the model now as part of the coding challenge or in the GSOC period? As I understood looking at the other issues, the coding challenges are supposed to give us an understanding of how DeepPavlov works but migrating the whole model to DeepPavlov should be a part of the GSOC task itself right?

AjinkyaDeshpande39 commented 3 years ago

Respected sir, Myself Ajinkya. I am a Btech student from Nagpur India. I am very interested to work in this project. From the above discussion, I understood that project is about "migrating the model" My question is , do we have to implement the model now itself as a or is it going to be part of project with some different approach or with changes in it ? Also , do we need to mention results in proposal ?

dmitrijeuseew commented 3 years ago

@tathagata-raha Yes, migrating the model is a part of GSoC task.

dmitrijeuseew commented 3 years ago

@AjinkyaDeshpande39 The task is to migrate the model, but if you have any ideas how to improve the model so that it will show better metrics (or if you are going to use another approach that will show better metrics), you are welcome.

saarahasad commented 3 years ago

@dmitrijeuseew Hello everyone! I’m Saarah (https://saarahasad.github.io) from India currently doing a Masters of Technology in Computer Science. I've been working on this for a while as well. I'm working on migrating the model. It has been taking a long time to train(more than a day). So I'm trying to get access/borrow a setup fit for this. I've been training it using the coNLL04 dataset. Hope that's okay. Also, I noticed that Tathagata had sent a paper that used the DocRED dataset. Just want to confirm that we're focusing on the sentence-level RE and not document-level RE?

dmitrijeuseew commented 3 years ago

@saarahasad Hello! Document-level RE model is also very useful for DeepPavlov library.

rusdes commented 3 years ago

@dmitrijeuseew Hello! I am Rushil Desai. I’ve been researching about this particular task and DeepPavlov since a couple of days and I strongly feel that alongside the sentence-level RE (with table-sequence encoders as in https://arxiv.org/abs/2010.03851), a document-level RE(https://arxiv.org/pdf/2010.11304.pdf) is an important feature a conversational framework like DeepPavlov should have, as most of the times the input will consist of multiple sentences or texts. So basically, I am proposing to migrate the sentence-level RE(the table-sequence encoder one) to DeepPavlov configs and additionally convert the document-level RE from PyTorch to TensorFlow if it helps with it’s integration or makes it easier to work with in the future (although I read that DeepPavlov is built on top of TensorFlow, Keras, and PyTorch) and then migrate this to DeepPavlov configs as well.

What do you think about this? Should I go ahead with writing the proposal and submit it in the Google Form?

dmitrijeuseew commented 3 years ago

@rusdes Hello! Yes, you should write a proposal and sibmit it. I think that converting document-level RE from PyTorch to Tensorflow is not necessary because now we are converting our DeepPavlov models from Tensorflow to PyTorch so it will be better if RE model will use PyTorch.

saarahasad commented 3 years ago

Had a query on using RE for KBQA @dmitrijeuseew

Here relations extracted could be used with entities to better the results when finding top-k relations predicted by classification model

or

Could it be used to better results of entity linking. It would solve ambiguity to some extent (when they are multiple matches in the KB)

or

would it replace the need to rank candidate relations (which would fill the query template) because you can now extract the relations from the question itself

Daishinkan002 commented 3 years ago

Or will it be useful in solidifying our knowledge base, instead of iterating over with redundant information in dataset over multiple domains ? @dmitrijeuseew