PrithivirajDamodaran / Gramformer

A framework for detecting, highlighting and correcting grammatical errors on natural language text. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.
MIT License
1.5k stars 175 forks source link

Training dataset #14

Closed d4buss closed 3 years ago

d4buss commented 3 years ago

Hi Prithiviraj,

Is there any chance you'd be able to release the training dataset you used to train the Gramformer huggingface model? I see that there are some details on the slices of data that you brought together in the Readme, but it would be useful to be able to use the same data that you used.

The main reason I'm asking is I'd like to create a model that can take correct text and add grammatical errors to it. So I was thinking I could take the dataset you used to train Gramformer and use the inverse to train a model that does the inverse. I can go through the data prep process as you did, but it would definitely be easier if I were able to reuse yours, and it might be useful for reproducibility for others as well.

PrithivirajDamodaran commented 3 years ago

Hi Alex - Sure will be happy to share:

The "data prep-process" you lightly mentioned in passing is compute-heavy and involved the following:

So, Sure will be happy to share it for a price, to compensate for the compute and storage cost that I incurred in the cloud VM.

I love the spirit of open source and contributing back to the community. But sorry I cannot do compute charity.

Note: This isn't an issue - Keep this space clear for creating issues with the library.

Best, Prithivi