getao / icae

The repo for In-context Autoencoder
Creative Commons Zero v1.0 Universal
54 stars 2 forks source link

Pretraining dataset for icae #13

Open Frankstein73 opened 2 months ago

Frankstein73 commented 2 months ago

Hello, thanks for the awesome work you did! Could you please clarify whether ICAE uses the entire Pile dataset during its pretraining phase, or if it only utilizes a subset of it?

getao commented 2 months ago

We used the entire dataset. However, we only train the ICAE with tens of billion tokens.