awslabs / graphstorm

Enterprise graph machine learning framework for billion-scale graphs for ML scientists and data scientists.
Apache License 2.0
360 stars 57 forks source link

[GSProcessing] Custom Data Split disorder #976

Open jalencato opened 1 month ago

jalencato commented 1 month ago

If we store our custom split generated masks in multiple files, it will cause the final generated graph in a disorder. A temporary fix can be used to restrict it to 1. But we can try to optimize it for better scalability issue.

datarsingh007 commented 3 weeks ago

Concatenate Masks: Combine all masks into a single file before generating the graph. Indexing: Use a metadata file to maintain the order of masks stored in separate files. Sequential Naming: Save masks with sequential filenames (e.g., mask_001, mask_002) to ensure correct processing order. Merge After Parallel Processing: Generate masks in parallel, then reorder them using sequence numbers before merging. Central Controller: Implement a centralized system to manage mask generation and order.