dice-group / Ontolearn

Ontolearn is an open-source software library for explainable structured machine learning in Python. It learns OWL class expressions from positive and negative examples.
https://ontolearn-docs-dice-group.netlify.app/index.html
MIT License
43 stars 9 forks source link

High memory consumption of data list #491

Closed LckyLke closed 1 week ago

LckyLke commented 1 week ago

I noticed that the data list is using way too much ram when running the retrieval_eval script for bigger / medium-sized datasets. → at the end of a run with the suramin.owl dataset, which is not even that big, it is ~ 8.4 GB large

Thus, for bigger datasets, we tend to rely on swap memory - (if one does have insane amounts of ram) - for this script, making it extremely slow. I suggest we don't save the retrieval_y and retrieval_neural_y sets in the data, as it is kinda unnecessary anyway if we already compute f1 score and jaccard similarity.

→ this change reduces the size to only ~ 12 MB for the suramin.owl dataset.

https://github.com/dice-group/Ontolearn/blob/77fae24bfeeeda3f0a67749527863089e12aef69/examples/retrieval_eval.py#L228-L239

LckyLke commented 1 week ago

I have a working fix on this branch/commit: 537f786755263d39782bd4f5808a04109db3554e

Demirrr commented 1 week ago

We can directly write each row into the csv residing in the disk. There must be pandas data frame stream possibility

LckyLke commented 1 week ago

Yes also a good idea 👍🏻

LckyLke commented 1 week ago

We can directly write each row into the csv residing in the disk. There must be pandas data frame stream possibility

this works but also makes the script very slow - i would prefer this: https://github.com/dice-group/Ontolearn/commit/537f786755263d39782bd4f5808a04109db3554e

LckyLke commented 1 week ago
        df_row = pd.DataFrame(
            [{
                "Expression": owl_expression_to_dl(expression),
                "Type": type(expression).__name__,
                "Jaccard Similarity": jaccard_sim,
                "F1": f1_sim,
                "Runtime Benefits": runtime_y - runtime_neural_y,
                "Runtime Neural": runtime_neural_y,
                "Symbolic_Retrieval": retrieval_y,
                "Symbolic_Retrieval_Neural": retrieval_neural_y,
            }])
        # Append the row to the CSV file
        df_row.to_csv(args.path_report, mode='a', header=not file_exists, index=False)
        file_exists = True

this is how i did it for writing to disk directly - and it is very slow :(

Demirrr commented 1 week ago

Let's solve a problem at a time :) let's first reduce the memory usage :) later we can write the disk batch wise or create a DB and send async writes since the order doesn't matter