ChEB-AI / python-chebai

GNU Affero General Public License v3.0
12 stars 4 forks source link

Data processing performance needs to be improved #32

Closed sfluegel05 closed 2 months ago

sfluegel05 commented 5 months ago

Data preprocessing is currently not as efficient as it could / should be. On my machine, it took several minutes to create a new dataset (for ChEBI version 231). Large parts of that went into the creation of data splits.

While most of this is not a problem since datasets can be reused between training runs, after adding dynamic datasplits in PR #29, the split creation is repeated at the start of each run.

Tasks

aditya0by0 commented 5 months ago

Hi @sfluegel05 After profiling the preprocessing step, it is clear that the majority of the time is spent in the split function of MultilabelStratifiedShuffleSplit, which is part of a standard imported library.

I have attached the .pstat file generated by applying cProfile to our code for your reference: dynamic_split_cprofiler_stats.zip

Given that the time-consuming part is within a standard library function, I am currently unable to think of a direct approach to optimize it further for performance.

I would appreciate any further insights or suggestions you might have regarding this performance issue.

Statistics, filtered by time required in decreasing order: iterative_cprofile_stats

Call Graph: iterative_cprofile_call_graph

sfluegel05 commented 5 months ago

In that case, I'm afraid there is not much we can do regarding the data splits. Could you have a look at the whole preprocessing pipeline as well? Maybe there are some other steps we can optimise.

sfluegel05 commented 5 months ago

I tried replacing the iterative_stratification package with scikit-multilearn. But apparently, that implementation is less efficient than iterative_stratification (although the algorithm is the same).

aditya0by0 commented 2 months ago

Refactor _extract_class_hierarchy Method in ChEBIOverXPartial for Data preprocessing

We can refactor of the _extract_class_hierarchy method in the subclass to leverage the existing functionality from the superclass. This change will improve code reuse and maintainability.

Current Implementation

The current implementation in the subclass duplicates the logic for extracting the class hierarchy from the ChEBI ontology and then filters the graph to include only the subclasses of the top class ID.

Method Implementation in _ChEBIDataExtractor

Method Implementation in ChEBIOverXPartial

Proposed Change

We can simplify the subclass method by calling the superclass method to extract the full class hierarchy and then filter the graph to include only the descendants of self.top_class_id. Since the graph is already transitively closed, we can use g.successors directly instead of nx.descendants, which simplifies the filtering process. This approach ensures that we reuse the existing extraction logic and focus only on the specific filtering needed for the subclass.

Updated Method

Here's the proposed updated method for the subclass:

def _extract_class_hierarchy(self, chebi_path: str) -> nx.DiGraph:
    """
    Extracts a subset of ChEBI based on subclasses of the top class ID.

    This method calls the superclass method to extract the full class hierarchy, 
    then extracts the subgraph containing only the descendants of the top class ID.

    Args:
        chebi_path (str): The file path to the ChEBI ontology file.

    Returns:
        nx.DiGraph: The extracted class hierarchy as a directed graph, limited to the 
        descendants of the top class ID.
    """
    g = super()._extract_class_hierarchy(chebi_path)    
    g = g.subgraph(list(g.successors(self.top_class_id)) + [self.top_class_id])
    return g

Please review the proposed change and let me know if you have any feedback or concerns.

sfluegel05 commented 2 months ago

I agree that this makes sense. I don't see how this is connected to the issue however. Could you open a new issue for this and then implement it?

sfluegel05 commented 2 months ago

We decided that at this point no obvious performance improvements can be made. With the dynamic data split implementation, it is possible to reuse splits and datasets. We recommend doing that to avoid long loading times.