NYU-MLDA / OpenABC

OpenABC-D is a large-scale labeled dataset generated by synthesizing open source hardware IPs. This dataset can be used for various graph level prediction problems in chip design.
BSD 3-Clause "New" or "Revised" License
113 stars 19 forks source link

OPENABC2_DATASET contains only step0 files ? #16

Open udaymallappa opened 2 months ago

udaymallappa commented 2 months ago

After downloading and unzipping OPENABC2_DATASET.zip (Torch tensor format), I see only step0 files in the processed/ directory. Dont we have AIGs for all the 20 steps ?

animeshbchowdhury commented 2 months ago

Hi @udaymallappa,

Sorry, the zenodo only allowed dataset upto 50GB at the time when I uploaded the dataset. The original dataset do have all the AIGs processed for each step. The reason for keeping this dataset short was to predict the QoR using synthesis recipe and the starting AIG only.

If you face the issue downloading and unzipping the original dataset, let me know.

udaymallappa commented 2 months ago

Thanks for your quick response. I am particularly interested in the ML-ready torch dataset and the original dataset has a lot of other information that I do not need. In any case, I was trying to load the *pt files corresponding to step0, but run into data loading issues as a result of torch version. Also, does "step0" correspond to unoptimized start AIG circuits ? If so, does 1500 synthesis recipes for each step0 flavor would be identical right ?

animeshbchowdhury commented 2 months ago

Yes, you're correct.

The step0 pt will have the same original unoptimized AIG, however, the labels will be different as those will capture the post synthesis results.

udaymallappa commented 2 months ago

Okay. Is there a way to host the 50GB torch-only dataset with all the 20 steps ?

animeshbchowdhury commented 2 months ago

Let me see what can be done. Can you explain what is your requirement?

The one which is already hosted on the zenodo has consumed 18GB of space. Now, for each nth step AIG, there are 1500 (20-n) length recipes across all designs. So it will be even difficult to host all the steps with all the recipe labels.

udaymallappa commented 2 months ago

We are looking to learn representations for AIG circuits. Because all step0 circuits correspond to the same circuit, all we have is just 29 unique circuits. If you could generate 20 steps for each circuit, with 1500 synthesis recipes, that would help us obtain more aig circuits, for the training purpose.

animeshbchowdhury commented 2 months ago

Hi @udaymallappa, the data you’re asking for are the processed pt format dataset of all 870k aigs. It’s part of the original dataset however the entire data is too much to be uploaded on zenodo since even the compressed version need more than 500GB.

Let me figure out a way to host original dataset in such a way that anyone do not have to download the entire data but only the processed dataset.

animeshbchowdhury commented 2 months ago

@udaymallappa, give me some time by end of this week. I need to coordinate with NYU-IT dept. to manage this. The requirement from their end is to evenly chunk the entire dataset and host it. That was the main reason why the entire dataset was zipped and chunked.

I will try to find a way via which only processed pt files are grouped together and rest are zipped together and chunked. This will take some time on my side.

Also, I believe the pyg version used to dump the dataset was older version. For compatibility, with newer version, please follow the following thread:

https://github.com/pyg-team/pytorch_geometric/discussions/5528

I plan to migrate the entire dataset to new pytorch version but it will take some time. Thank you for your patience.