dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.93k stars 1.86k forks source link

Text Classification: auto-concat multiple text fields #6452

Open torronen opened 1 year ago

torronen commented 1 year ago

Is your feature request related to a problem? Please describe. At the moment, Text Classification accepts only one text field.

Many of my datasets include multiple columns, each with some text. My understanding is that dataset creators typically want to put each part in separate column so the data can be used in databases, so I believe this is the case with others, too. Typical example might be: Title, Contents, Tags, Author.

I would like to use at least Title and Contents as input for Text Classification. Title might be the most valuable field, but contents are sometimes very important, too. One example like this can be news articles: sometimes title is good, but sometimes it can "click-bait title" and the contents are more important.

Describe the solution you'd like A simple solution might be concatenate all text fields user has selected.

Q: If this would be implemented, does the order of concatenation matter?

Describe alternatives you've considered Only other alternative I know of is to re-create the dataset with all data in one field. I could also ask the data team to add one column "AllText" but it will almost double the size of the file.

beccamc commented 1 year ago

@luisquintanilla Is this a framework ask?

luisquintanilla commented 1 year ago

@beccamc It is a framework ask.

michaelgsharp commented 1 year ago

@luisquintanilla @beccamc I think this is more of a model builder/auto ml task. In the framework, we can accept 2 columns out of the gate and if you want to have the columns concatenated you can add that as a step in the pipeline itself. Only with AutoML/ModelBuilder do you not have that option if its not done for you.

Thoughts?

beccamc commented 1 year ago

That feels reasonable to me. Sounds like something we should just create a sample for.