fani-lab / LADy

LADy 💃: A Benchmark Toolkit for Latent Aspect Detection Enriched with Backtranslation Augmentation
Other
3 stars 4 forks source link

Distribution of aspect terms and aspect categories in datasets #36

Closed hosseinfani closed 1 year ago

hosseinfani commented 1 year ago

@farinamhz @Lillliant As part of stats on datasets, we need to show the distribution of aspects in each dataset, which is probably long-tail or imbalanced.

and distribution of words in semevals

For an example codebase, you can look at this function that produce stats on dataset of teams:

https://github.com/fani-lab/OpeNTF/blob/45aa32b1e32edc906d926c7f841a4ec089f34d18/src/cmn/team.py#L210

farinamhz commented 1 year ago

Any updates @Lillliant?

Lillliant commented 1 year ago

Hi @farinamhz,

Sorry for the late reply. I've finished the preliminary code and plots for the distribution of aspects and words (tokens), which I've made a PR (#39) so the code's accuracy can be reviewed.

Additionally, the plots can be seen in this folder: semeval data.

Currently, only the original datasets for semeval are used for generating the stats, because it doesn't look like the augmented reviews.pkl have information on the new aspects based on the code and the generated stats. Please let me know if I should use the labelled backtranslation datasets from the data folder instead.

An example for semeval-16/15/14 looks like this (naspects_nreviews)

The gists of the stats in the folders are as follows:

hosseinfani commented 1 year ago

Hi @Lillliant @farinamhz

thank you very much. few questions:

https://github.com/fani-lab/LADy/blob/main/src/cmn/LADy.png

let me know if you need more help.

Lillliant commented 1 year ago

Hi @hosseinfani @farinamhz,

The input for the methods is indeed *.pkl. I will update the code so it looks a little clearer what the input is.

Also, thank you for telling me about the augs dictionary storing the augmented reviews. It's taking my computer a while to generate the augmented datasets, but I'll update the branch with distribution results as soon as they are generated.

hosseinfani commented 1 year ago

@Lillliant I have all the files in my pc. I'll upload them in lady channel now. you don't have to generate the translations.

hosseinfani commented 1 year ago

@Lillliant @farinamhz I think we can safely close this issue. let me know otherwise.