HamedBabaei / LLMs4OL

LLMs4OL:‌ Large Language Models for Ontology Learning
MIT License
87 stars 8 forks source link

Tasks and conclusions from meeting 9 Jan 2023 regarding dataset preparations #1

Closed HamedBabaei closed 1 year ago

HamedBabaei commented 1 year ago

WN18RR We check out the dataset and its diagrams and we decided on a few things and tasks for this dataset.


FB15K-237 We conclude that the hierarchy that I extended for this dataset is kind of our contribution and we stick to this hierarchy for moving forward with this dataset.

Again we need to rethink this after getting clear visions (I mean completing my diagrams)


Geonames We talk about how level 2 is being generated regarding notebook 02-Geoname-levels-creation.ipynb with a frequency matrix regarding the start string for level 2. and also we concluded the following tasks:

After these tasks, we should see what's so ever the new version of dataset stats is fine for us in terms of the frequency of classes in each level or not.


UMLS

We have a lot of samples with entity types and relations that we don't know which to consider. However, to continue we need the following information (we decided only consider the English language):

any of these two tasks will allow us to proceed with cutting samples into lower sizes.

HamedBabaei commented 1 year ago

Hi @jd-coderepos

I put everything discussed in the meeting in this issue for easy follow-up for the next meeting. Please feel free to add anything that I missed.

HamedBabaei commented 1 year ago

WN18R Dataset stats

Entity Typing:
      The size of the entity type detection train set is:40476
      The size of the entity type detection test set is:5232
      The size of entity type detection valid set is:5110
      -----------------------------------------------------------------
      The size of overall entity type detection dataset is:50818

Relationship Detection:
      The size of the train set is:85536
      The size of the test set is:3078
      The size of the valid set is:2993
      -----------------------------------------------------------------
      The size of overall dataset is:91607

class-based frequencies for datasets (entity type detection):

dataset NN ‌VB JJ RB
Train 31761 7663 1023 29
Test 3784 1336 107 5
Valid 3709 1276 122 3

class-based frequencies for datasets (relation type detection):

wn18r-fq-org

The stats after applying $FQ{type} < 10,000$ for training and $FQ{type} < 3,700$ for test and validation (based on observation 3,700 was appropriate for test and validation sets):

Entity Typing:
      The size of the entity type detection train set is:18715
      The size of the entity type detection test set is:4448
      The size of the entity type detection valid set is:4401
      -----------------------------------------------------------------
      The size of overall entity type detection dataset is:27564

Relationship Detection:
      The size of the train set is:22375
      The size of the test set is:2971
      The size of the valid set is:2984
        -----------------------------------------------------------------
      The size of overall dataset is:28330

class-based frequencies for datasets (entity type detection):

dataset NN ‌VB JJ RB
Train 10000 7663 1023 29
Test 3700 1336 107 5
Valid 3700 1276 122 3

class-based frequencies for datasets (relation type detection): wn18r-fq-cleaned

HamedBabaei commented 1 year ago

Hi @jd-coderepos According to these stats,

Train:

_similar_to                       52
_member_of_domain_region          43
_member_of_domain_usage           10

Test:

_member_of_domain_region          26
_member_of_domain_usage           22
_similar_to                        3

Validation:

_member_of_domain_region          34
_member_of_domain_usage           22
_similar_to                        3
HamedBabaei commented 1 year ago

@jd-coderepos Jupyter notebook for references: https://github.com/HamedBabaei/LLMs4OL/blob/main/notebooks/04-WN18R.ipynb

Please confirm the above mentioned changes if they are appropriate and I will make the changes.

jd-coderepos commented 1 year ago

It actually makes sense to remove those three relations you show in the tables above. TBH, it is not clear what they mean anyway so removing makes our lives easier in terms of interpreting the relations we consider.

I saw the frequency for the RB class and I mostly agree that we can drop it.

HamedBabaei commented 1 year ago

Geonames dataset stats:

condition-1

condition-2

geoname.stats.xlsx

HamedBabaei commented 1 year ago

Hi @jd-coderepos

According to the plot and statistics dataframe (in this table colored classes are going to be considered, we also made some combinations for classes, particularly for Level-3) we can move forward with this dataset in the following manner:

HamedBabaei commented 1 year ago

@jd-coderepos Jupyter notebook for references: https://github.com/HamedBabaei/LLMs4OL/blob/main/notebooks/05-Geonames.ipynb

Please confirm the above-mentioned changes if they are appropriate and I will make the changes.

HamedBabaei commented 1 year ago

UMLS

Hi @jd-coderepos the table of frequencies for types based on sources (SAB column) -- using obtained entities! (CUIs): (it can be loaded into pandas df or vs codes as well)

type_sab_matrix.csv

we have the following stats for now (on relationship detection and entity detection set):

size of UMLS‌ relation detection set:19_783_580
size of UMLS‌ entity detection set:2_093_042

I was thinking to do the same as we did with Geonames for this dataset as well.

HamedBabaei commented 1 year ago

I played around with levels based frequency in the entity type detection dataset (which we have 2M samples for them) and I got the following samples to be considered in each level:‌ (green colored classes are ok and reds are not ok)

level-2-which to consider

level-3-which to consider

NOTE:‌ Just for clarification, I have created level 4 only for entity types classification for level-3.

HamedBabaei commented 1 year ago

we decided to move on with:

NCI SNOMEDCT_US MEDCIN

sources on UMLS dataset.

HamedBabaei commented 1 year ago

everything is completed on this issue so I will close it!