Closed HamedBabaei closed 1 year ago
Hi @jd-coderepos
I put everything discussed in the meeting in this issue for easy follow-up for the next meeting. Please feel free to add anything that I missed.
WN18R Dataset stats
Entity Typing:
The size of the entity type detection train set is:40476
The size of the entity type detection test set is:5232
The size of entity type detection valid set is:5110
-----------------------------------------------------------------
The size of overall entity type detection dataset is:50818
Relationship Detection:
The size of the train set is:85536
The size of the test set is:3078
The size of the valid set is:2993
-----------------------------------------------------------------
The size of overall dataset is:91607
class-based frequencies for datasets (entity type detection):
dataset | NN | VB | JJ | RB |
---|---|---|---|---|
Train | 31761 | 7663 | 1023 | 29 |
Test | 3784 | 1336 | 107 | 5 |
Valid | 3709 | 1276 | 122 | 3 |
class-based frequencies for datasets (relation type detection):
Entity Typing:
The size of the entity type detection train set is:18715
The size of the entity type detection test set is:4448
The size of the entity type detection valid set is:4401
-----------------------------------------------------------------
The size of overall entity type detection dataset is:27564
Relationship Detection:
The size of the train set is:22375
The size of the test set is:2971
The size of the valid set is:2984
-----------------------------------------------------------------
The size of overall dataset is:28330
class-based frequencies for datasets (entity type detection):
dataset | NN | VB | JJ | RB |
---|---|---|---|---|
Train | 10000 | 7663 | 1023 | 29 |
Test | 3700 | 1336 | 107 | 5 |
Valid | 3700 | 1276 | 122 | 3 |
class-based frequencies for datasets (relation type detection):
Hi @jd-coderepos According to these stats,
RB
class from entity types due to the low frequency. _similar_to
, _member_of_domain_region
, and _member_of_domain_usage
relations from the dataset. Train:
_similar_to 52
_member_of_domain_region 43
_member_of_domain_usage 10
Test:
_member_of_domain_region 26
_member_of_domain_usage 22
_similar_to 3
Validation:
_member_of_domain_region 34
_member_of_domain_usage 22
_similar_to 3
@jd-coderepos Jupyter notebook for references: https://github.com/HamedBabaei/LLMs4OL/blob/main/notebooks/04-WN18R.ipynb
Please confirm the above mentioned changes if they are appropriate and I will make the changes.
It actually makes sense to remove those three relations you show in the tables above. TBH, it is not clear what they mean anyway so removing makes our lives easier in terms of interpreting the relations we consider.
I saw the frequency for the RB class and I mostly agree that we can drop it.
Hi @jd-coderepos
According to the plot and statistics dataframe (in this table colored classes are going to be considered, we also made some combinations for classes, particularly for Level-3) we can move forward with this dataset in the following manner:
[x] According to Level-2-P table (the second table). It is better to remove this from Level-2 and only consider PPL in Level 3 and keep P in level 1. Because the Level-2-P class frequencies are:
PPL 999952
STL 44
Due to this highly imbalanced nature of Level-2-P we may not be able to consider this level.
[x] The same scenario happens to Level 1 class A. Its frequencies in Level-2-A for level-2 classes are as follows:
ADM 515004
PCL 264
PRS 197
ZN 33
LTE 18
TER 7
ZNB 4
I recommend removing Level-2-A and considering only Level-3-ADM for Level-3 samples and keeping all samples for Level-1.
[x] Most of the classes in Level-2 has only a single class in Level-3. This has no use for analysis from our perspective. For example, Level-3-L-LCT has only 1 class in level 3. So we might ignore all of the classes in Level-3 that have this condition.
[x] Classes in Level-2 might have less than 1000 samples so we are interested in ignoring them in both Level-2 and Level-3s. for example Level-2-U and Level-3-U classes are ignored.
@jd-coderepos Jupyter notebook for references: https://github.com/HamedBabaei/LLMs4OL/blob/main/notebooks/05-Geonames.ipynb
Please confirm the above-mentioned changes if they are appropriate and I will make the changes.
Hi @jd-coderepos the table of frequencies for types based on sources (SAB column) -- using obtained entities! (CUIs): (it can be loaded into pandas df or vs codes as well)
we have the following stats for now (on relationship detection and entity detection set):
size of UMLS relation detection set:19_783_580
size of UMLS entity detection set:2_093_042
I was thinking to do the same as we did with Geonames for this dataset as well.
I played around with levels based frequency in the entity type detection dataset (which we have 2M samples for them) and I got the following samples to be considered in each level: (green colored classes are ok and reds are not ok)
NOTE: Just for clarification, I have created level 4 only for entity types classification for level-3.
we decided to move on with:
NCI SNOMEDCT_US MEDCIN
sources on UMLS dataset.
everything is completed on this issue so I will close it!
WN18RR We check out the dataset and its diagrams and we decided on a few things and tasks for this dataset.
NN
,JJ
,VB
,RB
)also_see
and consider_hypernym
. What about others?FB15K-237 We conclude that the hierarchy that I extended for this dataset is kind of our contribution and we stick to this hierarchy for moving forward with this dataset.
01-analysis of datasets.ipynb
for this datasetAgain we need to rethink this after getting clear visions (I mean completing my diagrams)
Geonames We talk about how level 2 is being generated regarding notebook
02-Geoname-levels-creation.ipynb
with a frequency matrix regarding the start string for level 2. and also we concluded the following tasks:After these tasks, we should see what's so ever the new version of dataset stats is fine for us in terms of the frequency of classes in each level or not.
UMLS
We have a lot of samples with entity types and relations that we don't know which to consider. However, to continue we need the following information (we decided only consider the English language):
any of these two tasks will allow us to proceed with cutting samples into lower sizes.