dice-group / WHALE

0 stars 0 forks source link

Create base training model using a chunksize #9

Open sshivam95 opened 2 weeks ago

sshivam95 commented 2 weeks ago

The original idea of using skiprows together with nrows parameter in pandas.read_csv was a bad idea.

Pandas is using a bafflingly memory-intensive way of implementing skiprows. On using skiprows=12_000_000_000, tt is basically doing skiprows = set(list(range(skiprows))). It's building a giant list and a set, each containing 12 billion triples!

sshivam95 commented 2 weeks ago

Use iterator=True option in pandas.read_csv

sshivam95 commented 2 weeks ago

Won't do. Create files and use these files to train models

sshivam95 commented 2 weeks ago

Update: Partition dataset using domains (namespace in XML or the authority part of the base URL). Basically the domain of subject and object should be same. If subject is connected to a blank node so it is in the domain in which the subject is.

sshivam95 commented 2 weeks ago

Update: For blank nodes connected to blank nodes, we have to take care of CBD.

sshivam95 commented 1 week ago

Update: Domain specific datasets have been created for the following format datasets: Datasets materialized:

Link them with wikidata using LIMES and then clean them by removing literals and materializing the blank nodes

sshivam95 commented 3 days ago

Update: Discussion with Sherif to minimize the number of KGs by merging the smaller KGs with bigger ones if the subject exists in the bigger one. A threshold need to be found for the following datasets: Datasets materialized: