Aspect Sentiment Triplet Extraction Baseline

farinamhz commented 1 year ago

Hey @arfaamr, I hope you are doing well. As we want to add a new baseline in LADy that is suitable for both aspect term and sentiment extraction, I have chosen Enhanced Multi-Channel Graph Convolutional Network for Aspect Sentiment Triplet Extraction paper accepted in ACL 2022. You can check it out and run it using their repo at this link: https://github.com/CCChenhao997/EMCGCN-ASTE

Let me know if you have any questions or need help on this task.

@hosseinfani

arfaamr commented 1 year ago

Hi, I've taken a look at the repo and paper. I'm a little confused; am I supposed to try to train/test the model on the augmented datasets used with LADy? I assumed our repo would have datasets with the original reviews + backtranslated reviews to run the model with, but I wasn't able to find where that would be, or if I'm approaching this correctly at all.

farinamhz commented 1 year ago

Hey @arfaamr, We'd like you to first check out the repo and run it on their datasets to figure out how it works. As the datasets are also similar (they are semeval but may be in different formats), it will be easier to integrate their work with LADy. So, after you can successfully run their codeline, we will integrate it with LADy.

We can talk more in our meeting today if it still needs clarification.

arfaamr commented 1 year ago

[Last week's progress]

Tried to install required libraries for EMCGCN
- transformers could not be installed to specified version [error: failed building wheels for tokenizers], so installed current version and slightly modified code to work with it instead
Ran code -- error: Torch not compiled with CUDA enabled. It is not compatible with my device, so switched it to using mps (enables gpu training on mac)
Now able to start training model, but stops at ~10% due to out of memory error
This week, will try on cpu/another device, and try training on toy dataset first

hosseinfani commented 1 year ago

Thanks @arfaamr Please update us on your run on toy example. Then connect with @3ripleM to integrate it to LADy pipeline. thanks.

arfaamr commented 1 year ago

@hosseinfani: I tried using cpu with the full original dataset, and it trains without error, but is slow. I set epochs=2 instead of 100 just to see if it would train/test to completion, and it did. Should I still make a toy dataset and train with the default 100 epochs, or is this enough to evaluate that the code works, and move to integrating with LADy?

hosseinfani commented 1 year ago

@arfaamr

awesome. yes, it's enough. Please post the snapshot of your run here for the record.
you can go ahead with integrating it to lady. please connect with Medi to create a new class definition and etc.

@3ripleM please connect with Arfaa for this.

arfaamr commented 1 year ago

Run on cpu for 2 epochs

3ripleM commented 1 year ago

Hi @arfaamr ,

For the integration, we have a file named mdl.py, which is an abstract class for creating a new baseline.

To add the new baseline to LADy, you need to add a file to the aml directory, and its name should match the name of the baseline. Additionally, in the main.py, you should add the model name to line 204.

Furthermore, there are more definitions to understand in the abstract class. Please investigate the class model, and if there are any parts you don't understand, feel free to contact me.

We can also schedule a Teams meeting. I believe it would be better to have a meeting once you've had a chance to review the LADy project in order to discuss your tasks.

arfaamr commented 1 year ago

Thanks @3ripleM. I have created emc.py and added the model to main.py. So far, I have only created a basic init() with naspects, nwords. I'm not sure how to integrate the rest of the needed functions. I would be available to meet anytime today or tomorrow after noon.

arfaamr commented 1 year ago

This week, started functions in emc.py, looking at other files in /aml for reference. Most of the other files seem to import a library for the baseline, ie import fastext, bert_e2e_absa, to use functions like train, etc. so I tried that. I created an __init__.py in my baseline's repo, but am not sure exactly how to import it in emc.py, as the repo name has a dash, which is invalid in import statements. Additionally, I noticed settings['train'] in params.py seems to have sections for other baselines. Should I add my baseline's args in there as well? Also, I am currently working on this on a local branch of LADy. Should I push it so someone can check my progress and see if it looks okay so far, or just wait until I finish the functions?

arfaamr commented 1 year ago

This week, I worked on preprocess() in emc.py, for converting LADy's data to the format used in the EMCGCN code to use for training, etc. in emc.py.

I found that LADy's raw data is as XML files. EMC's raw data is stored in JSON and .vocab files. The JSON files seem equivalent to LADy's XML files, but I am not sure what the .vocab files are for. They have names such as postag, deprel, and are loaded in with pickle.load, but I was not able to open or print them to see what exactly they contain.

As I understand it, LADy's XML data is used to create Review objects, which are used for training, etc. EMC's JSON and .vocab data is used to create Instance objects, which are similar to Review objects. I am working on converting Review objects to Instance objects, so that when LADy is run, its XML data will be used to create Reviews, then those can be used to create Instances, which will be used in emc.py's train(), etc.

Let me know if this is an incorrect interpretation of how LADy works or how I should approach this.

arfaamr commented 12 months ago

Almost finished writing wrapper.py using parameters of Review objects to write to json file. Have not tested it yet, only unsure about the meaning of a few keywords that don't directly correspond like "postag", "deprel", "target_tags".

farinamhz commented 12 months ago

Hi @arfaamr, thank you. Are they variables of the code? Could you give me an example of the code in which they have been used?

arfaamr commented 12 months ago

@farinamhz: I think I figured out what "target_tags" means now, but I'm still unsure about the others.

For "postag", "deprel", etc, they are written in the input JSON files like this, for each review:

{ ... "postag": ["CC", "DT", "NN", "VBD", "RB", "JJ", "IN", "PRP", "."], ... , "deprel": ["cc", "det", "nsubj", "cop", "advmod", "root", "case", "obl", "punct"]}

And are also loaded in from .vocab files and used in train(), like this:

l_rpd = 0.01 F.cross_entropy(post_pred.reshape([-1, post_pred.shape[3]]), tags_symmetry_flatten, ignore_index=-1) l_dep = 0.01 F.cross_entropy(deprel_pred.reshape([-1, deprel_pred.shape[3]]), tags_symmetry_flatten, ignore_index=-1) l_psc = 0.01 F.cross_entropy(postag.reshape([-1, postag.shape[3]]), tags_symmetry_flatten, ignore_index=-1) l_tbd = 0.01 F.cross_entropy(synpost.reshape([-1, synpost.shape[3]]), tags_symmetry_flatten, ignore_index=-1)

I did not see them mentioned in the paper, and I couldn't find anything about them online.

farinamhz commented 12 months ago

@arfaamr,

I haven't checked out the code, but based on what you provided,

In my opinion, "postag" appears to represent part-of-speech tagging (POS tagging).

1_FU82azyx9DciALX9bU4ZeQ

And "deprel" seems to be helpful for parsing or similar things. It also appears to represent a linguistic annotation, specifically for dependency relations in a structured data format. In dependency grammar, these labels describe the relationships between words in a sentence.

arfaamr commented 11 months ago

In meeting with Farinam we found that EMC has a function that may be able to generate postag, deprel, etc. However I'm not sure how to use it. it almost looks as though it takes part of the JSON as input to further generate something else instead.

I tried using the nltk library to generate these things instead, and I am able to generate postag, but head and deprel seem more complicated. It might be better to try again with EMC's function instead first, but I need help with that.

There also seem to be other libraries besides nltk that can do this, but I'm not sure of their reliability

arfaamr commented 11 months ago

Hi @farinamhz:

I have completed as much of the wrapper as I can. I put wrapper.py, splits.json, review.pkl in LADy's src directory, and it was able to output .jsons to output/.

Currently the problems I am still facing are that:

LADy's get_aos() in review.py does not seem to give any opinion. All opinions are blank, so opinion_tags in the jsons are incorrect as a result.
nltk's postag() has a parameter for language, but only supports English and Russian. Is this sufficient, or do we need to find an alternative?
using stanza for head, deprel results in error:

error: [W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware.

It may work on someone else's device instead, or we could try a different library.

Without using stanza, the rest of the code runs without error and produces json files like EMC's. I have a local branch of LADy that I was working on, but I think I don't have permission to push it. I pushed my wrapper.py to a branch of my EMC fork instead for you to see it, here. It probably still has more errors but is as close as I can get.

I have upcoming finals this week, so I probably won't be able to work on it more. Sorry for the inconvenience.

farinamhz commented 11 months ago

Hey @arfaamr, Thank you for the updates. We'll take care of the rest. Good luck with the exams!

@hosseinfani

farinamhz commented 11 months ago

Meanwhile, as reviews are all in English whether original or backtranslated, I don't think that NLTK would have any problem, @arfaamr.

arfaamr commented 11 months ago

@farinamhz, OK, that works out then. The original wrapper.py that you gave me had a lang parameter in the preprocess() function, so I wasn't sure whether that meant the function was expected to work with multiple languages.

fani-lab / LADy

Aspect Sentiment Triplet Extraction Baseline #57