ChEB-AI / python-chebai

GNU Affero General Public License v3.0
11 stars 4 forks source link

Data handling needs to be restructured #10

Closed sfluegel05 closed 1 month ago

sfluegel05 commented 6 months ago

Status quo

Goal

aditya0by0 commented 3 months ago

Hi @sfluegel05, I have doubt regarding the issue. Do we have to implement the above restructuring only for chebi dataset or for all other datasets too.

sfluegel05 commented 3 months ago

This is only for the ChEBI datasets. The other datasets have their own structure. That should be adjusted as well at some point, but that would be a different issue

sfluegel05 commented 3 months ago

A special case for the data splits is the chebi_version_train:

Use case

You want to compare two models trained on different versions of ChEBI. In order to make a fair comparison, you need to evaluate both models on the same test set (and train them on training sets that don't overlap with this test set).

Tasks

Most of the functionality is already implemented for that, it just needs to be adapted to the dynamic data splits. In the end, no new files should be created for specific splits.