Open sshivam95 opened 2 weeks ago
Use iterator=True
option in pandas.read_csv
Won't do. Create files and use these files to train models
Update: Partition dataset using domains (namespace in XML or the authority part of the base URL). Basically the domain of subject and object should be same. If subject
is connected to a blank node so it is in the domain in which the subject
is.
Update: For blank nodes connected to blank nodes, we have to take care of CBD.
Update: Domain specific datasets have been created for the following format datasets: Datasets materialized:
Link them with wikidata using LIMES and then clean them by removing literals and materializing the blank nodes
Update: Discussion with Sherif to minimize the number of KGs by merging the smaller KGs with bigger ones if the subject exists in the bigger one. A threshold need to be found for the following datasets: Datasets materialized:
The original idea of using
skiprows
together withnrows
parameter inpandas.read_csv
was a bad idea.Pandas is using a bafflingly memory-intensive way of implementing
skiprows
. On usingskiprows=12_000_000_000
, tt is basically doingskiprows = set(list(range(skiprows)))
. It's building a giant list and a set, each containing 12 billion triples!