Closed LckyLke closed 1 week ago
I have a working fix on this branch/commit: 537f786755263d39782bd4f5808a04109db3554e
We can directly write each row into the csv residing in the disk. There must be pandas data frame stream possibility
Yes also a good idea 👍🏻
We can directly write each row into the csv residing in the disk. There must be pandas data frame stream possibility
this works but also makes the script very slow - i would prefer this: https://github.com/dice-group/Ontolearn/commit/537f786755263d39782bd4f5808a04109db3554e
df_row = pd.DataFrame(
[{
"Expression": owl_expression_to_dl(expression),
"Type": type(expression).__name__,
"Jaccard Similarity": jaccard_sim,
"F1": f1_sim,
"Runtime Benefits": runtime_y - runtime_neural_y,
"Runtime Neural": runtime_neural_y,
"Symbolic_Retrieval": retrieval_y,
"Symbolic_Retrieval_Neural": retrieval_neural_y,
}])
# Append the row to the CSV file
df_row.to_csv(args.path_report, mode='a', header=not file_exists, index=False)
file_exists = True
this is how i did it for writing to disk directly - and it is very slow :(
Let's solve a problem at a time :) let's first reduce the memory usage :) later we can write the disk batch wise or create a DB and send async writes since the order doesn't matter
I noticed that the data list is using way too much ram when running the retrieval_eval script for bigger / medium-sized datasets. → at the end of a run with the suramin.owl dataset, which is not even that big, it is ~ 8.4 GB large
Thus, for bigger datasets, we tend to rely on swap memory - (if one does have insane amounts of ram) - for this script, making it extremely slow. I suggest we don't save the retrieval_y and retrieval_neural_y sets in the data, as it is kinda unnecessary anyway if we already compute f1 score and jaccard similarity.
→ this change reduces the size to only ~ 12 MB for the suramin.owl dataset.
https://github.com/dice-group/Ontolearn/blob/77fae24bfeeeda3f0a67749527863089e12aef69/examples/retrieval_eval.py#L228-L239