Create base training model using a chunksize

dice-group / WHALE

0 stars 0 forks source link

Create base training model using a chunksize #9

Open sshivam95 opened 2 weeks ago

sshivam95 commented 2 weeks ago

The original idea of using skiprows together with nrows parameter in pandas.read_csv was a bad idea.

Pandas is using a bafflingly memory-intensive way of implementing skiprows. On using skiprows=12_000_000_000, tt is basically doing skiprows = set(list(range(skiprows))). It's building a giant list and a set, each containing 12 billion triples!

sshivam95 commented 2 weeks ago

Use iterator=True option in pandas.read_csv

sshivam95 commented 2 weeks ago

Won't do. Create files and use these files to train models

sshivam95 commented 2 weeks ago

Update: Partition dataset using domains (namespace in XML or the authority part of the base URL). Basically the domain of subject and object should be same. If subject is connected to a blank node so it is in the domain in which the subject is.

sshivam95 commented 2 weeks ago

Update: For blank nodes connected to blank nodes, we have to take care of CBD.

sshivam95 commented 1 week ago

Update: Domain specific datasets have been created for the following format datasets: Datasets materialized:

[x] adr_dataset
[x] hcalendar_dataset
[x] hlisting_dataset
[x] hresume_dataset
[x] rdfa_dataset
[x] xfn_dataset
[x] geo_dataset
[x] hcard_dataset
[x] hrecipe_dataset
[x] hreview_dataset
[x] species_dataset
[ ] jsonld_dataset
[x] microdata_dataset

Link them with wikidata using LIMES and then clean them by removing literals and materializing the blank nodes

sshivam95 commented 3 days ago

Update: Discussion with Sherif to minimize the number of KGs by merging the smaller KGs with bigger ones if the subject exists in the bigger one. A threshold need to be found for the following datasets: Datasets materialized:

[ ] adr_dataset
[ ] hcalendar_dataset
[ ] hlisting_dataset
[ ] hresume_dataset
[ ] rdfa_dataset
[ ] xfn_dataset
[ ] geo_dataset
[ ] hcard_dataset
[ ] hrecipe_dataset
[ ] hreview_dataset
[ ] species_dataset
[ ] jsonld_dataset
[ ] microdata_dataset