fani-lab / LADy

LADy 💃: A Benchmark Toolkit for Latent Aspect Detection Enriched with Backtranslation Augmentation
Other
3 stars 3 forks source link

Aspect Sentiment Triplet Extraction Baseline #57

Open farinamhz opened 9 months ago

farinamhz commented 9 months ago

Hey @arfaamr, I hope you are doing well. As we want to add a new baseline in LADy that is suitable for both aspect term and sentiment extraction, I have chosen Enhanced Multi-Channel Graph Convolutional Network for Aspect Sentiment Triplet Extraction paper accepted in ACL 2022. You can check it out and run it using their repo at this link: https://github.com/CCChenhao997/EMCGCN-ASTE

Let me know if you have any questions or need help on this task.

@hosseinfani

arfaamr commented 8 months ago

Hi, I've taken a look at the repo and paper. I'm a little confused; am I supposed to try to train/test the model on the augmented datasets used with LADy? I assumed our repo would have datasets with the original reviews + backtranslated reviews to run the model with, but I wasn't able to find where that would be, or if I'm approaching this correctly at all.

farinamhz commented 8 months ago

Hey @arfaamr, We'd like you to first check out the repo and run it on their datasets to figure out how it works. As the datasets are also similar (they are semeval but may be in different formats), it will be easier to integrate their work with LADy. So, after you can successfully run their codeline, we will integrate it with LADy.

We can talk more in our meeting today if it still needs clarification.

arfaamr commented 8 months ago

[Last week's progress]

hosseinfani commented 8 months ago

Thanks @arfaamr Please update us on your run on toy example. Then connect with @3ripleM to integrate it to LADy pipeline. thanks.

arfaamr commented 8 months ago

@hosseinfani: I tried using cpu with the full original dataset, and it trains without error, but is slow. I set epochs=2 instead of 100 just to see if it would train/test to completion, and it did. Should I still make a toy dataset and train with the default 100 epochs, or is this enough to evaluate that the code works, and move to integrating with LADy?

hosseinfani commented 8 months ago

@arfaamr

@3ripleM please connect with Arfaa for this.

arfaamr commented 8 months ago
runcpu

Run on cpu for 2 epochs

3ripleM commented 8 months ago

Hi @arfaamr ,

For the integration, we have a file named mdl.py, which is an abstract class for creating a new baseline.

To add the new baseline to LADy, you need to add a file to the aml directory, and its name should match the name of the baseline. Additionally, in the main.py, you should add the model name to line 204.

Furthermore, there are more definitions to understand in the abstract class. Please investigate the class model, and if there are any parts you don't understand, feel free to contact me.

We can also schedule a Teams meeting. I believe it would be better to have a meeting once you've had a chance to review the LADy project in order to discuss your tasks.

arfaamr commented 8 months ago

Thanks @3ripleM. I have created emc.py and added the model to main.py. So far, I have only created a basic init() with naspects, nwords. I'm not sure how to integrate the rest of the needed functions. I would be available to meet anytime today or tomorrow after noon.

arfaamr commented 8 months ago

This week, started functions in emc.py, looking at other files in /aml for reference. Most of the other files seem to import a library for the baseline, ie import fastext, bert_e2e_absa, to use functions like train, etc. so I tried that. I created an __init__.py in my baseline's repo, but am not sure exactly how to import it in emc.py, as the repo name has a dash, which is invalid in import statements. Additionally, I noticed settings['train'] in params.py seems to have sections for other baselines. Should I add my baseline's args in there as well? Also, I am currently working on this on a local branch of LADy. Should I push it so someone can check my progress and see if it looks okay so far, or just wait until I finish the functions?

arfaamr commented 7 months ago

This week, I worked on preprocess() in emc.py, for converting LADy's data to the format used in the EMCGCN code to use for training, etc. in emc.py.

I found that LADy's raw data is as XML files. EMC's raw data is stored in JSON and .vocab files. The JSON files seem equivalent to LADy's XML files, but I am not sure what the .vocab files are for. They have names such as postag, deprel, and are loaded in with pickle.load, but I was not able to open or print them to see what exactly they contain.

As I understand it, LADy's XML data is used to create Review objects, which are used for training, etc. EMC's JSON and .vocab data is used to create Instance objects, which are similar to Review objects. I am working on converting Review objects to Instance objects, so that when LADy is run, its XML data will be used to create Reviews, then those can be used to create Instances, which will be used in emc.py's train(), etc.

Let me know if this is an incorrect interpretation of how LADy works or how I should approach this.

arfaamr commented 7 months ago

Almost finished writing wrapper.py using parameters of Review objects to write to json file. Have not tested it yet, only unsure about the meaning of a few keywords that don't directly correspond like "postag", "deprel", "target_tags".

farinamhz commented 7 months ago

Hi @arfaamr, thank you. Are they variables of the code? Could you give me an example of the code in which they have been used?

arfaamr commented 7 months ago

@farinamhz: I think I figured out what "target_tags" means now, but I'm still unsure about the others.

For "postag", "deprel", etc, they are written in the input JSON files like this, for each review:

{ ... "postag": ["CC", "DT", "NN", "VBD", "RB", "JJ", "IN", "PRP", "."], ... , "deprel": ["cc", "det", "nsubj", "cop", "advmod", "root", "case", "obl", "punct"]}

And are also loaded in from .vocab files and used in train(), like this:

l_rpd = 0.01 F.cross_entropy(post_pred.reshape([-1, post_pred.shape[3]]), tags_symmetry_flatten, ignore_index=-1) l_dep = 0.01 F.cross_entropy(deprel_pred.reshape([-1, deprel_pred.shape[3]]), tags_symmetry_flatten, ignore_index=-1) l_psc = 0.01 F.cross_entropy(postag.reshape([-1, postag.shape[3]]), tags_symmetry_flatten, ignore_index=-1) l_tbd = 0.01 F.cross_entropy(synpost.reshape([-1, synpost.shape[3]]), tags_symmetry_flatten, ignore_index=-1)

I did not see them mentioned in the paper, and I couldn't find anything about them online.

farinamhz commented 7 months ago

@arfaamr,

I haven't checked out the code, but based on what you provided,

In my opinion, "postag" appears to represent part-of-speech tagging (POS tagging).

1_FU82azyx9DciALX9bU4ZeQ

And "deprel" seems to be helpful for parsing or similar things. It also appears to represent a linguistic annotation, specifically for dependency relations in a structured data format. In dependency grammar, these labels describe the relationships between words in a sentence.

arfaamr commented 7 months ago

In meeting with Farinam we found that EMC has a function that may be able to generate postag, deprel, etc. However I'm not sure how to use it. it almost looks as though it takes part of the JSON as input to further generate something else instead.

I tried using the nltk library to generate these things instead, and I am able to generate postag, but head and deprel seem more complicated. It might be better to try again with EMC's function instead first, but I need help with that.

There also seem to be other libraries besides nltk that can do this, but I'm not sure of their reliability

arfaamr commented 7 months ago

Hi @farinamhz:

I have completed as much of the wrapper as I can. I put wrapper.py, splits.json, review.pkl in LADy's src directory, and it was able to output .jsons to output/.

Currently the problems I am still facing are that:

It may work on someone else's device instead, or we could try a different library.

Without using stanza, the rest of the code runs without error and produces json files like EMC's. I have a local branch of LADy that I was working on, but I think I don't have permission to push it. I pushed my wrapper.py to a branch of my EMC fork instead for you to see it, here. It probably still has more errors but is as close as I can get.

I have upcoming finals this week, so I probably won't be able to work on it more. Sorry for the inconvenience.

farinamhz commented 7 months ago

Hey @arfaamr, Thank you for the updates. We'll take care of the rest. Good luck with the exams!

@hosseinfani

farinamhz commented 7 months ago

Meanwhile, as reviews are all in English whether original or backtranslated, I don't think that NLTK would have any problem, @arfaamr.

arfaamr commented 7 months ago

@farinamhz, OK, that works out then. The original wrapper.py that you gave me had a lang parameter in the preprocess() function, so I wasn't sure whether that meant the function was expected to work with multiple languages.