facebookresearch / PyTorch-BigGraph

Generate embeddings from large-scale graph-structured data.
https://torchbiggraph.readthedocs.io/
Other
3.37k stars 449 forks source link

A pbg helper : efficiently convert big dataset using spark to pbg hdf5 format #137

Open rom1504 opened 4 years ago

rom1504 commented 4 years ago

Hello, I'm not sure if it really fits as an issue, but I wanted to inform you we (@vinyesm and me) created this https://github.com/graph-embeddings/pbg-helper that makes it possible to efficiently create hdf5 file for pbg for big dataset using spark It's mostly composed of 2 part :

  1. a spark job to create some parquet files in edge format (already bucketized)
  2. a python script that does a very light conversion between this preprocessed parquet and hdf5

I thought it might be of interest to people.

We also included in this repo a graph knn viewer which might be interesting for visualization

lw commented 4 years ago

That's awesome! Thanks for letting us know! I'll make sure to point your way anyone that may be interested by this!

adamlerer commented 4 years ago

Thanks @rom1504 , that's great ! Can you comment on how (2) compares to the existing parquet importer?

https://github.com/facebookresearch/PyTorch-BigGraph/blob/master/torchbiggraph/converters/import_from_parquet.py

rom1504 commented 4 years ago

Sure. The current parquet importer applies the same logic as for importing csv files, and hence has the same limitations : it processes each entity and each edge sequentially and in python, which is pretty slow. It would take many hours to process the big freebase dataset (we initially tried that and it was indeed too slow).

What we coded and included in this pbg helper repo is a spark version of what the importer does, which is mostly converting string entity and relation to integer ones and applying partitioning. As it's using spark it's able to efficiently run these operations in parallel and is quite fast (10min in our case, this depends on how many executors are used)

Transforming this to hdf5 in python is then much faster (around 10min again) because the dataset in int is much smaller and because there is no processing to be done, only IO. (I looked and using python is basically the only way to write hdf5)

So the benefit of using this spark version is when you have a big dataset (billion of edges).

Note that we chose spark because we have access to a big yarn cluster but the same technique could be probably be applied using equivalent distributed computing technologies than spark.