aditya0by0 commented 2 months ago

Issue #45

Dependency :

PR #39 should be merged before this PR
PR #57 should be merged before this PR
PR #55 should be merged before this PR
PR #54 should be merged before this PR

Unit Testing Checklist

`reader.py`

DataReader:
- [x] Write unit tests for to_data() with sample input values.
ChemDataReader:
- [x] Write unit tests for _read_data() with sample SMILES strings.
DeepChemDataReader:
- [x] Write unit tests for _read_data() with sample input values.
SelfiesReader:
- [x] Write unit tests for _read_data() with sample SELFIES strings.
ProteinDataReader:
- [x] Write unit tests for _read_data() with sample protein sequences.

`collate.py`

DefaultCollator:
- [x] Write unit tests for __call__() with sample data.
RaggedCollator:
- [x] Write unit tests for __call__() with sample data.
- [x] Write unit tests for process_label_rows() with sample data.

`datasets/base.py`

XYBaseDataModule:
- [x] Write unit tests for _filter_labels() with sample input values.
DynamicDataset:
- [x] Write unit tests for get_test_split() with sample data.
- [x] Write unit tests for get_train_val_splits_given_test() with sample data.
- [x] Unit test that checks if the generated splits are stratified

`datasets/chebi.py`

_ChEBIDataExtractor:
- [x] Write unit tests for _extract_class_hierarchy() with mock data.
- [x] Write unit tests for _graph_to_raw_dataset() with mock data.
- [x] Write unit tests for _load_dict() with mock data.
- [x] Write unit tests for _setup_pruned_test_set() with mock data.
ChEBIOverX:
- [x] Write unit tests for select_classes() with sample data.
ChEBIOverXPartial:
- [x] Write unit tests for extract_class_hierarchy() with mock data.
- [x] Write unit test for single-label scenario (see PR #54)
term_callback
- [x] Write unit tests for term_callback() with sample data.

`datasets/go_uniprot.py`

_GOUniprotDataExtractor:
[x] setup is failing (due to recent changes in the GO class)
- [x] Write unit tests for _extract_class_hierarchy() with mock data.
- [x] Write unit tests for term_callback() with sample data.
- [x] Write unit tests for _graph_to_raw_dataset() with mock data.
- [x] Write unit tests for _get_swiss_to_go_mapping() with mock data.
- [x] Write unit tests for _load_dict() with mock data.
_GoUniProtOverX:
- [x] Write unit tests for select_classes() with sample data.

`datasets/tox21.py`

Tox21Challenge:
- [x] Write unit tests for setup_processed() with mock data.
- [x] Write unit tests for _load_data_from_file() using mock file operations.
- [x] Write unit tests for _load_dict() with mock data.

`datasets/protein_pretraining.py`

_ProteinPretrainingData:
- [x] Write unit tests for _parse_protein_data_for_pretraining() with mock data.

Note: Tests for Tox21MolNet will be added later in seperate PR/branch after completion of the issue #53

Tox21MolNet:

[] Write unit tests for setup_processed() with mock data.

[] Check if output format is correct (the collator) expects a dict with features, labels, ident keys, features have to be>> able to be converted to a tensor

[] Write unit tests for _load_data_from_file() using mock file operations.

aditya0by0 commented 2 months ago

A Test for `RaggedCollator` is failing!

Issue Description

There is a potential misalignment issue in the RaggedCollator class when processing data where some labels are None. Currently, the code correctly omits None labels from the y list but does not simultaneously remove the corresponding features from the x list. This causes a misalignment between features and labels, leading to incorrect training or evaluation outcomes.

Failing Test Case

tests/unit/collators/testRaggedCollator.test_call_with_missing_entire_labels Test Case

Currently, this test fails because the feature corresponding to the None label is not omitted, causing a misalignment in the result.x and result.y.

Please let me know if this test case is relevant and correctly aligned with the purpose of the RaggedCollator class. Additionally, confirm if the expected results in the test case are appropriate and consistent with the class's intended functionality.

Potential Solution

To fix the issue, the features (x) should also be filtered based on the non_null_labels index, ensuring that x and y remain aligned.

Here's the corrected portion of the code:

non_null_labels = [i for i, r in enumerate(y) if r is not None]
y = self.process_label_rows(
    tuple(ye for i, ye in enumerate(y) if i in non_null_labels)
)
x = [xe for i, xe in enumerate(x) if i in non_null_labels]  # Filter x based on non_null_labels
loss_kwargs["non_null_labels"] = non_null_labels

This ensures that both x and y contain only the valid (non-None) entries and that they remain properly aligned.

MGlauer commented 2 months ago

There is a potential misalignment issue in the RaggedCollator class when processing data where some labels are None. Currently, the code correctly omits None labels from the y list but does not simultaneously remove the corresponding features from the x list. This causes a misalignment between features and labels, leading to incorrect training or evaluation outcomes.

This is intended behaviour. In some training examples, we use a mixture of labelled and unlabelled data in combination with certain loss functions that allow for partially unlabelled data (e.g. fuzzy loss). In order to compute the usual metrics (F1, MSE etc), one needs to filter the predictions for unlabelled data and only compute them on labelled data. The indices of these data points are stored in the ' non_null_labeles' field and used by our implementations of Electra and MixedLoss.

MGlauer commented 2 months ago

Therefore, the shape of y should only align with x modulo non_null_labels.

aditya0by0 commented 2 months ago

Test Case Failing for `term_callback`

A test case for term_callback is failing because it is not correctly ignoring/skipping obsolete ChEBI terms. As a result, the test cases for _extract_class_hierarchy and _graph_to_raw_dataset are also failing as output of term_callback are used by them.

Current Behavior:

Right now, this failure does not seem to affect the current pre-processing pipeline with Real data, because obsolete ChEBI terms typically do not have SMILES strings.
The _graph_to_raw_dataset method filters out data instances:
- without SMILES strings:
- without relationship to other instances
```
data = data[~data["SMILES"].isnull()]
data = data[data.iloc[:, self._LABELS_START_IDX:].any(axis=1)]
```
  So, even though obsolete terms are not specifically filtered, their lack of SMILES strings ensures they are excluded from the dataset.

Potential Future Issue:

In future versions of ChEBI, if any obsolete terms do have SMILES strings and maintain relationships with non-obsolete terms, it could become a problem.
Since the current filtering is based solely on non-null SMILES strings and relationships to other terms, there’s no explicit logic to filter obsolete terms.

Example of a Problematic Obsolete Term:

[Term]
id: CHEBI:77533
name: Compound G
is_a: CHEBI:99999
property_value: http://purl.obolibrary.org/obo/chebi/smiles "C1=C1Br" xsd:string
is_obsolete: true

If terms like this exist in future releases, the current approach could lead to errors because obsolete terms with SMILES strings might slip through the filters.

Proposed Solution: We can update the term_callback logic to explicitly ignore obsolete terms by checking for the is_obsolete clause:

if isinstance(clause, fastobo.term.IsObsoleteClause):
    if clause.obsolete:
        # If the term document contains an "obsolete: true" clause, skip this term.
        return False

This solution would ensure that obsolete terms are skipped before they are processed, preventing potential future issues with the dataset.

aditya0by0 commented 2 months ago

Facing a Technical issue in `Tox21MolNet`:

I've encountered an issue with the setup_processed method when working with the Tox21MolNet and its data (tox21.csv file). It appears that the file does not include a header or key named "group", which is causing a KeyError in the line:

groups = np.array([d["group"] for d in data])

Additionally, the _load_data_from_file method does not seem to utilize the any Reader to create or handle a "group" key in the data. As a result, the group key does not exist in the dictionaries produced by _load_data_from_file, leading to the observed error. The _load_data_from_file method only yields three keys: features, labels, and ident:

yield dict(features=smiles, labels=labels, ident=row["mol_id"])

Please let me know for your suggestions on this issue.

sfluegel05 commented 1 month ago

As discussed, here are some additional test cases (I also added them at the top):

Readers: Should also check if the "real" token order (as defined by tokens.txt) stays consistent
ChEBIOverXPartial: should cover one label scenario from PR #54
DynamicDataset: Check for the data splits if their are stratified
setup_processed tests: should also check if the output has a structure that can be read by the collator (e.g., features should be tensor-able) -> expected to fail before #56 is resolved

aditya0by0 commented 1 month ago

Readers: Should also check if the "real" token order (as defined by tokens.txt) stays consistent

To ensure the token order in the "real" tokens.txt file remains consistent, we can maintain a corresponding duplicate tokens.txt file in the test directory. This duplicate file will serve as the reference for validating the order of tokens in the actual tokens.txt. During testing, we will compare the contents of the real file against this reference to check for consistency in both content and order.

Alternatively, we could verify the token order before and after any token insertion to ensure order consistency without the need for a duplicate file. However, this approach would be vulnerable to manual or direct changes in the tokens.txt file, which may not be detected.

Please let me know if you have any suggestions or alternative approaches to this method.

aditya0by0 commented 1 month ago

@sfluegel05, can you please provide your suggestion/input on the respective comment.

Readers: Should also check if the "real" token order (as defined by tokens.txt) stays consistent

To ensure the token order in the "real" tokens.txt file remains consistent, we can maintain a corresponding duplicate tokens.txt file in the test directory. This duplicate file will serve as the reference for validating the order of tokens in the actual tokens.txt. During testing, we will compare the contents of the real file against this reference to check for consistency in both content and order.

Alternatively, we could verify the token order before and after any token insertion to ensure order consistency without the need for a duplicate file. However, this approach would be vulnerable to manual or direct changes in the tokens.txt file, which may not be detected.

Please let me know if you have any suggestions or alternative approaches to this method.

aditya0by0 commented 3 weeks ago

I have added the test for protein pretraining. Now all the unit tests are working. Please review and merge.

aditya0by0 commented 2 weeks ago

Do you think it would be appropriate to include the unit tests related to Tox21MolNet in the same pull request or issue that addresses its rectification, specifically PR #56?

Thanks for finishing this. I removed the link to the unit test issue since we still have the toxicity-related unit tests which are not included in this PR.

sfluegel05 commented 2 weeks ago

I agree. I added a note for that in #56

ChEB-AI / python-chebai

PreProcessing unit tests #48

Issue #45

Dependency :

Unit Testing Checklist

`reader.py`

`collate.py`

`datasets/base.py`

`datasets/chebi.py`

`datasets/go_uniprot.py`

`datasets/tox21.py`

`datasets/protein_pretraining.py`

A Test for `RaggedCollator` is failing!

Issue Description

Failing Test Case

Potential Solution

Test Case Failing for `term_callback`

Facing a Technical issue in `Tox21MolNet`:

ChEB-AI / python-chebai

PreProcessing unit tests #48

Issue #45

Dependency :

Unit Testing Checklist

reader.py

collate.py

datasets/base.py

datasets/chebi.py

datasets/go_uniprot.py

datasets/tox21.py

datasets/protein_pretraining.py

A Test for RaggedCollator is failing!

Issue Description

Failing Test Case

Potential Solution

Test Case Failing for term_callback

Facing a Technical issue in Tox21MolNet:

`reader.py`

`collate.py`

`datasets/base.py`

`datasets/chebi.py`

`datasets/go_uniprot.py`

`datasets/tox21.py`

`datasets/protein_pretraining.py`

A Test for `RaggedCollator` is failing!

Test Case Failing for `term_callback`

Facing a Technical issue in `Tox21MolNet`: