IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
153 stars 109 forks source link

[Feature] Implement functionality to check for grammar, punctuation, spelling errors in a given text #384

Open Bytes-Explorer opened 2 months ago

Bytes-Explorer commented 2 months ago

Search before asking

Component

Transforms/Other

Feature

Implement a new feature to detect and eliminate grammar, punctuation or spelling from a given text. This functionality should work on every row of a parquet file, where every row contains one document. The output should be True or False, and this should be added as an output column along with span of the text where errors are detected along with the corrected text.

This can be added as a new transform for text/NLP data. One can refer to code quality module as a reference for how filters have been applied for code data.

Are you willing to submit a PR?

SowmyaLR commented 1 month ago

@Bytes-Explorer I would like to work on this card. Will take the usual time of two weeks for this card.

Bytes-Explorer commented 1 month ago

Sounds good!

touma-I commented 1 month ago

@SowmyaLR Please don't hesitate to reach out if you run into any issues related to the framework as you build this transform. How familiar are you with the PDF2Parquet transform ? it creates a row in a parquet from Markdown sections of the document.

SowmyaLR commented 1 month ago

Hi @touma-I I have done research around grammar correction. Need to start this transform by this week. Thank you for the information about PDF2Parquet file detail. I need to check on this and will get back for further queries.

SowmyaLR commented 1 month ago

Hi @Bytes-Explorer @touma-I I have a few questions about this task

  1. Can the input for this transform contain emojis and table other non-alphabetical characters?
  2. Can the input be in any language(example: French, Hindi, Tamil)?
  3. Each doc in the parquet table will maintain the structure of the original document?(each doc will have the metadata that it belongs to paragrah1 and next paragraph like that)
SowmyaLR commented 1 month ago

@Bytes-Explorer @touma-I any updates on the above questions?

Bytes-Explorer commented 1 month ago

Sorry missed this @SowmyaLR

  1. Yes, we can clean for those issues too
  2. We need the solution to support at a minimum English language, but will also be nice to do multi-lingual
  3. For 3, we have two ways of ingesting documents right now, PDF and HTML. You can try out both of them with sample files to get an understanding on what will be structure of the parquet file.