Training dataset - Githubissues

PrithivirajDamodaran / Gramformer

A framework for detecting, highlighting and correcting grammatical errors on natural language text. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

MIT License

1.5k stars 175 forks source link

Hi Prithiviraj,

Is there any chance you'd be able to release the training dataset you used to train the Gramformer huggingface model? I see that there are some details on the slices of data that you brought together in the Readme, but it would be useful to be able to use the same data that you used.

The main reason I'm asking is I'd like to create a model that can take correct text and add grammatical errors to it. So I was thinking I could take the dataset you used to train Gramformer and use the inverse to train a model that does the inverse. I can go through the data prep process as you did, but it would definitely be easier if I were able to reuse yours, and it might be useful for reproducibility for others as well.

Hi Alex - Sure will be happy to share:

The "data prep-process" you lightly mentioned in passing is compute-heavy and involved the following:

Harvest WikiEdits, convert WikiText edits to <orig, edit> pairs, and filter out grammatical pairs alone - This took ~6 days in a VM. The approximate ratio is if you process 500K edits you will end up with ~75-100K grammatical edits.
The C4 raw data alone is ~800GB, This needed a VM with disk space >> 800GB. I had to download it and preprocess it to generate pairs. The entire pre-processing and pair generation ran for ~5 days.
Then generate grammar error distribution to ensure we have enough samples of each error type.
The VM used was - 8 vCPUs, 32GB Memory, and a 2TB Standard persistent disk.

So, Sure will be happy to share it for a price, to compensate for the compute and storage cost that I incurred in the cloud VM.

I love the spirit of open source and contributing back to the community. But sorry I cannot do compute charity.

Note: This isn't an issue - Keep this space clear for creating issues with the library.

Best, Prithivi

PrithivirajDamodaran / Gramformer

Training dataset #14