About the pt dataset - Githubissues

nicosouth commented 1 year ago

hello! I download the pt dataset.

I read the contents of the transgpt-pt.txt. I found three problems: i) A large number of sentences are forcibly broken into multiple sentences, and it seems that the length is not fixed. ii) Some special symbols (such as mathematical formulas) appear in the text with a confusing format and without special treatment. iii) Some meaningless pure numbers appear as separate text.

I'd like to consult you that whether these low-quality data affect training model. Thank you!

DUOMO commented 1 year ago

Hello! These low-quality data indeed may potentially affect the training model. The broken sentences and confusing format of special symbols may cause the model to learn incorrect patterns and generate inaccurate sentences.

We are working on another cleaning and preprocessing process before training the model to ensure better quality results. Thank you for your valuable suggestions✨ We welcome your contributions to the clean data construction project.

nicosouth commented 1 year ago

Thank you! Previously, I thought these low-quality data were added deliberately to enhance the robustness of the model.

DUOMO / TransGPT

About the pt dataset #1