ChEB-AI / python-chebai

GNU Affero General Public License v3.0
12 stars 4 forks source link

PreProcessing unit tests #48

Closed aditya0by0 closed 2 weeks ago

aditya0by0 commented 2 months ago

Dependency :

Unit Testing Checklist

reader.py

collate.py

datasets/base.py

datasets/chebi.py

datasets/go_uniprot.py

datasets/tox21.py

datasets/protein_pretraining.py

Note: Tests for Tox21MolNet will be added later in seperate PR/branch after completion of the issue #53

Tox21MolNet:

  • [] Write unit tests for setup_processed() with mock data.
    • [] Check if output format is correct (the collator) expects a dict with features, labels, ident keys, features have to be>> able to be converted to a tensor
  • [] Write unit tests for _load_data_from_file() using mock file operations.
aditya0by0 commented 2 months ago

A Test for RaggedCollator is failing!

Issue Description

There is a potential misalignment issue in the RaggedCollator class when processing data where some labels are None. Currently, the code correctly omits None labels from the y list but does not simultaneously remove the corresponding features from the x list. This causes a misalignment between features and labels, leading to incorrect training or evaluation outcomes.

Failing Test Case

tests/unit/collators/testRaggedCollator.test_call_with_missing_entire_labels Test Case

Currently, this test fails because the feature corresponding to the None label is not omitted, causing a misalignment in the result.x and result.y.

Please let me know if this test case is relevant and correctly aligned with the purpose of the RaggedCollator class. Additionally, confirm if the expected results in the test case are appropriate and consistent with the class's intended functionality.

Potential Solution

To fix the issue, the features (x) should also be filtered based on the non_null_labels index, ensuring that x and y remain aligned.

Here's the corrected portion of the code:

non_null_labels = [i for i, r in enumerate(y) if r is not None]
y = self.process_label_rows(
    tuple(ye for i, ye in enumerate(y) if i in non_null_labels)
)
x = [xe for i, xe in enumerate(x) if i in non_null_labels]  # Filter x based on non_null_labels
loss_kwargs["non_null_labels"] = non_null_labels

This ensures that both x and y contain only the valid (non-None) entries and that they remain properly aligned.

MGlauer commented 2 months ago

There is a potential misalignment issue in the RaggedCollator class when processing data where some labels are None. Currently, the code correctly omits None labels from the y list but does not simultaneously remove the corresponding features from the x list. This causes a misalignment between features and labels, leading to incorrect training or evaluation outcomes.

This is intended behaviour. In some training examples, we use a mixture of labelled and unlabelled data in combination with certain loss functions that allow for partially unlabelled data (e.g. fuzzy loss). In order to compute the usual metrics (F1, MSE etc), one needs to filter the predictions for unlabelled data and only compute them on labelled data. The indices of these data points are stored in the ' non_null_labeles' field and used by our implementations of Electra and MixedLoss.

MGlauer commented 2 months ago

Therefore, the shape of y should only align with x modulo non_null_labels.

aditya0by0 commented 2 months ago

Test Case Failing for term_callback

A test case for term_callback is failing because it is not correctly ignoring/skipping obsolete ChEBI terms. As a result, the test cases for _extract_class_hierarchy and _graph_to_raw_dataset are also failing as output of term_callback are used by them.

Current Behavior:

Potential Future Issue:

Example of a Problematic Obsolete Term:

[Term]
id: CHEBI:77533
name: Compound G
is_a: CHEBI:99999
property_value: http://purl.obolibrary.org/obo/chebi/smiles "C1=C1Br" xsd:string
is_obsolete: true

If terms like this exist in future releases, the current approach could lead to errors because obsolete terms with SMILES strings might slip through the filters.

Proposed Solution: We can update the term_callback logic to explicitly ignore obsolete terms by checking for the is_obsolete clause:

if isinstance(clause, fastobo.term.IsObsoleteClause):
    if clause.obsolete:
        # If the term document contains an "obsolete: true" clause, skip this term.
        return False

This solution would ensure that obsolete terms are skipped before they are processed, preventing potential future issues with the dataset.

aditya0by0 commented 2 months ago

Facing a Technical issue in Tox21MolNet:

I've encountered an issue with the setup_processed method when working with the Tox21MolNet and its data (tox21.csv file). It appears that the file does not include a header or key named "group", which is causing a KeyError in the line:

groups = np.array([d["group"] for d in data])

Additionally, the _load_data_from_file method does not seem to utilize the any Reader to create or handle a "group" key in the data. As a result, the group key does not exist in the dictionaries produced by _load_data_from_file, leading to the observed error. The _load_data_from_file method only yields three keys: features, labels, and ident:

yield dict(features=smiles, labels=labels, ident=row["mol_id"])

Please let me know for your suggestions on this issue.

sfluegel05 commented 1 month ago

As discussed, here are some additional test cases (I also added them at the top):

aditya0by0 commented 1 month ago
  • Readers: Should also check if the "real" token order (as defined by tokens.txt) stays consistent

To ensure the token order in the "real" tokens.txt file remains consistent, we can maintain a corresponding duplicate tokens.txt file in the test directory. This duplicate file will serve as the reference for validating the order of tokens in the actual tokens.txt. During testing, we will compare the contents of the real file against this reference to check for consistency in both content and order.

Alternatively, we could verify the token order before and after any token insertion to ensure order consistency without the need for a duplicate file. However, this approach would be vulnerable to manual or direct changes in the tokens.txt file, which may not be detected.

Please let me know if you have any suggestions or alternative approaches to this method.

aditya0by0 commented 1 month ago

@sfluegel05, can you please provide your suggestion/input on the respective comment.

  • Readers: Should also check if the "real" token order (as defined by tokens.txt) stays consistent

To ensure the token order in the "real" tokens.txt file remains consistent, we can maintain a corresponding duplicate tokens.txt file in the test directory. This duplicate file will serve as the reference for validating the order of tokens in the actual tokens.txt. During testing, we will compare the contents of the real file against this reference to check for consistency in both content and order.

Alternatively, we could verify the token order before and after any token insertion to ensure order consistency without the need for a duplicate file. However, this approach would be vulnerable to manual or direct changes in the tokens.txt file, which may not be detected.

Please let me know if you have any suggestions or alternative approaches to this method.

aditya0by0 commented 3 weeks ago

I have added the test for protein pretraining. Now all the unit tests are working. Please review and merge.

aditya0by0 commented 2 weeks ago

Do you think it would be appropriate to include the unit tests related to Tox21MolNet in the same pull request or issue that addresses its rectification, specifically PR #56?

Thanks for finishing this. I removed the link to the unit test issue since we still have the toxicity-related unit tests which are not included in this PR.

sfluegel05 commented 2 weeks ago

I agree. I added a note for that in #56