HamedBabaei commented 1 year ago

WN18RR We check out the dataset and its diagrams and we decided on a few things and tasks for this dataset.

[x] Set upper bound for samples based on entity type $FQ{type}<10000$ for train, and $FQ{type}<1000$ for test and validation sets. ($FQ_{type}$ is a frequency of a type and type could be NN, JJ, VB, RB)
[x] Could we check which relation types we want to consider at this step? We decided to ignore also_see and consider _hypernym. What about others?

FB15K-237 We conclude that the hierarchy that I extended for this dataset is kind of our contribution and we stick to this hierarchy for moving forward with this dataset.

[x] Complete my diagrams in 01-analysis of datasets.ipynb for this dataset
[x] Ignore the following class types due to the low number of frequencies (But again double-check the frequencies before removing them). We decided to remove types with frequencies less than 1000**.
```
Level-3-person-doctor                 213
Level-2-body_of_water                   43
Level-3-body_of_water-sea                2
```

Again we need to rethink this after getting clear visions (I mean completing my diagrams)

Geonames We talk about how level 2 is being generated regarding notebook 02-Geoname-levels-creation.ipynb with a frequency matrix regarding the start string for level 2. and also we concluded the following tasks:

[x] consider level 2 and set the upper bound for the number of classes, a maximum of 10 top frequent classes were considered in each level.
[x] Consider level 1 and set the upper bound for the number of samples in each level-1 class based on the frequency of samples in each class $FQ_{level-1}<1e6$.

After these tasks, we should see what's so ever the new version of dataset stats is fine for us in terms of the frequency of classes in each level or not.

UMLS

We have a lot of samples with entity types and relations that we don't know which to consider. However, to continue we need the following information (we decided only consider the English language):

[x] Table of frequencies for types based on sources (SAB column) -- using MRREL or MRCONSO file
[x] Table of frequencies for relationships based on sources (SAB column)

any of these two tasks will allow us to proceed with cutting samples into lower sizes.

HamedBabaei commented 1 year ago

Hi @jd-coderepos

I put everything discussed in the meeting in this issue for easy follow-up for the next meeting. Please feel free to add anything that I missed.

HamedBabaei commented 1 year ago

WN18R Dataset stats

Entity Typing:
      The size of the entity type detection train set is:40476
      The size of the entity type detection test set is:5232
      The size of entity type detection valid set is:5110
      -----------------------------------------------------------------
      The size of overall entity type detection dataset is:50818

Relationship Detection:
      The size of the train set is:85536
      The size of the test set is:3078
      The size of the valid set is:2993
      -----------------------------------------------------------------
      The size of overall dataset is:91607

class-based frequencies for datasets (entity type detection):

dataset	NN	‌VB	JJ	RB
Train	31761	7663	1023	29
Test	3784	1336	107	5
Valid	3709	1276	122	3

class-based frequencies for datasets (relation type detection):

wn18r-fq-org

The stats after applying $FQ{type} < 10,000$ for training and $FQ{type} < 3,700$ for test and validation (based on observation 3,700 was appropriate for test and validation sets):

Entity Typing:
      The size of the entity type detection train set is:18715
      The size of the entity type detection test set is:4448
      The size of the entity type detection valid set is:4401
      -----------------------------------------------------------------
      The size of overall entity type detection dataset is:27564

Relationship Detection:
      The size of the train set is:22375
      The size of the test set is:2971
      The size of the valid set is:2984
        -----------------------------------------------------------------
      The size of overall dataset is:28330

class-based frequencies for datasets (entity type detection):

dataset	NN	‌VB	JJ	RB
Train	10000	7663	1023	29
Test	3700	1336	107	5
Valid	3700	1276	122	3

class-based frequencies for datasets (relation type detection): wn18r-fq-cleaned

HamedBabaei commented 1 year ago

Hi @jd-coderepos According to these stats,

[x] First, I am willing to remove the RB class from entity types due to the low frequency.
[x] Second, the following relations have the following frequencies in a train, test, and validation, I am willing to remove _similar_to, _member_of_domain_region, and _member_of_domain_usage relations from the dataset.

Train:

_similar_to                       52
_member_of_domain_region          43
_member_of_domain_usage           10

Test:

_member_of_domain_region          26
_member_of_domain_usage           22
_similar_to                        3

Validation:

_member_of_domain_region          34
_member_of_domain_usage           22
_similar_to                        3

[ ] At last but not least, I want to combine test and validation sets as a test set for this dataset, since this is the analysis I think it is not meaningful to use a validation set here) in this case, we may have good samples for testing models.

HamedBabaei commented 1 year ago

@jd-coderepos Jupyter notebook for references: https://github.com/HamedBabaei/LLMs4OL/blob/main/notebooks/04-WN18R.ipynb

Please confirm the above mentioned changes if they are appropriate and I will make the changes.

jd-coderepos commented 1 year ago

It actually makes sense to remove those three relations you show in the tables above. TBH, it is not clear what they mean anyway so removing makes our lives easier in terms of interpreting the relations we consider.

I saw the frequency for the RB class and I mostly agree that we can drop it.

HamedBabaei commented 1 year ago

Geonames dataset stats:

CONDITION 1: The following diagram considers level 2 and set the upper bound for the number of classes, a maximum of 10 top frequent classes were considered in each level:

condition-1

CONDITION 2: The following diagram considers level 1 and set the upper bound for the number of samples in each level-1 class based on the frequency of samples in each class frequency less than 1M.

condition-2

STATS: the following dataframe represents the class-based frequency of samples for classes in each level.

geoname.stats.xlsx

HamedBabaei commented 1 year ago

Hi @jd-coderepos

According to the plot and statistics dataframe (in this table colored classes are going to be considered, we also made some combinations for classes, particularly for Level-3) we can move forward with this dataset in the following manner:

[x] According to Level-2-P table (the second table). It is better to remove this from Level-2 and only consider PPL in Level 3 and keep P in level 1. Because the Level-2-P class frequencies are:
```
PPL    999952
STL        44
```
Due to this highly imbalanced nature of Level-2-P we may not be able to consider this level.
[x] The same scenario happens to Level 1 class A. Its frequencies in Level-2-A for level-2 classes are as follows:
```
ADM    515004
PCL       264
PRS       197
ZN         33
LTE        18
TER         7
ZNB         4
```
I recommend removing Level-2-A and considering only Level-3-ADM for Level-3 samples and keeping all samples for Level-1.
[x] Most of the classes in Level-2 has only a single class in Level-3. This has no use for analysis from our perspective. For example, Level-3-L-LCT has only 1 class in level 3. So we might ignore all of the classes in Level-3 that have this condition.
[x] Classes in Level-2 might have less than 1000 samples so we are interested in ignoring them in both Level-2 and Level-3s. for example Level-2-U and Level-3-U classes are ignored.

HamedBabaei commented 1 year ago

@jd-coderepos Jupyter notebook for references: https://github.com/HamedBabaei/LLMs4OL/blob/main/notebooks/05-Geonames.ipynb

Please confirm the above-mentioned changes if they are appropriate and I will make the changes.

HamedBabaei commented 1 year ago

UMLS

Hi @jd-coderepos the table of frequencies for types based on sources (SAB column) -- using obtained entities! (CUIs): (it can be loaded into pandas df or vs codes as well)

type_sab_matrix.csv

we have the following stats for now (on relationship detection and entity detection set):

size of UMLS‌ relation detection set:19_783_580
size of UMLS‌ entity detection set:2_093_042

I was thinking to do the same as we did with Geonames for this dataset as well.

HamedBabaei commented 1 year ago

I played around with levels based frequency in the entity type detection dataset (which we have 2M samples for them) and I got the following samples to be considered in each level:‌ (green colored classes are ok and reds are not ok)

level-2-which to consider

level-3-which to consider

NOTE:‌ Just for clarification, I have created level 4 only for entity types classification for level-3.

HamedBabaei commented 1 year ago

we decided to move on with:

NCI SNOMEDCT_US MEDCIN

sources on UMLS dataset.

HamedBabaei commented 1 year ago

everything is completed on this issue so I will close it!

HamedBabaei / LLMs4OL

Tasks and conclusions from meeting 9 Jan 2023 regarding dataset preparations #1

The stats after applying $FQ{type} < 10,000$ for training and $FQ{type} < 3,700$ for test and validation (based on observation 3,700 was appropriate for test and validation sets):

Geonames dataset stats:

UMLS