ChEB-AI / python-chebai

GNU Affero General Public License v3.0
12 stars 4 forks source link

Build a 3-star ChEBI dataset #61

Open sfluegel05 opened 1 month ago

sfluegel05 commented 1 month ago

Status

Currently, we use all of chebi in our dataset. However, not all ChEBI data is equal. ChEBI distinguishes between 2-star and 3-star entities (see ChEBI user manual). 3-star entities are manually added by the ChEBI team, while other entities have been added by external parties.

Goal

The goal is to investigate which effect using only 3-star data has on the classification task. The hypothesis is that, for some classes, the 2-star entities are not classified correctly or completely. For example, tripeptide has 220 subclasses, 195 of which are 3-star. But there are about 8,000 peptides that should be classified as tripeptide, but are not. Most (if not all) of them are 2-star.

Task

Create a chebi dataset that selects only 3-star classes. Selecting the classes should be rather simple. The complicated part are the relations. Since we don't know which relations are 2-star (or if the relation of two 3-star classes can be trusted if they are held together by a 2-star class), we need to make compromises. The easiest solution would be to treat all relations as 3-star.