A test case for term_callback is failing because it is not correctly ignoring/skipping obsolete ChEBI terms. As a result, the test cases for _extract_class_hierarchy and _graph_to_raw_dataset are also failing as output of term_callback are used by them.
Current Behavior:
Right now, this failure does not seem to affect the current pre-processing pipeline with Real data, because obsolete ChEBI terms typically do not have SMILES strings.
The _graph_to_raw_dataset method filters out data instances:
without SMILES strings:
without relationship to other instances
data = data[~data["SMILES"].isnull()]
data = data[data.iloc[:, self._LABELS_START_IDX:].any(axis=1)]
So, even though obsolete terms are not specifically filtered, their lack of SMILES strings ensures they are excluded from the dataset.
Potential Future Issue:
In future versions of ChEBI, if any obsolete terms do have SMILES strings and maintain relationships with non-obsolete terms, it could become a problem.
Since the current filtering is based solely on non-null SMILES strings and relationships to other terms, there’s no explicit logic to filter obsolete terms.
If terms like this exist in future releases, the current approach could lead to errors because obsolete terms with SMILES strings might slip through the filters.
Proposed Solution:
We can update the term_callback logic to explicitly ignore obsolete terms by checking for the is_obsolete clause:
if isinstance(clause, fastobo.term.IsObsoleteClause):
if clause.obsolete:
# If the term document contains an "obsolete: true" clause, skip this term.
return False
This solution would ensure that obsolete terms are skipped before they are processed, preventing potential future issues with the dataset.
Test Case Failing for
term_callback
A test case for
term_callback
is failing because it is not correctly ignoring/skipping obsolete ChEBI terms. As a result, the test cases for_extract_class_hierarchy
and_graph_to_raw_dataset
are also failing as output ofterm_callback
are used by them.Current Behavior:
_graph_to_raw_dataset
method filters out data instances:So, even though obsolete terms are not specifically filtered, their lack of SMILES strings ensures they are excluded from the dataset.
Potential Future Issue:
Example of a Problematic Obsolete Term:
If terms like this exist in future releases, the current approach could lead to errors because obsolete terms with SMILES strings might slip through the filters.
Proposed Solution: We can update the
term_callback
logic to explicitly ignore obsolete terms by checking for theis_obsolete
clause:This solution would ensure that obsolete terms are skipped before they are processed, preventing potential future issues with the dataset.
Originally posted by @aditya0by0 in https://github.com/ChEB-AI/python-chebai/issues/48#issuecomment-2332645174