JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.87k stars 711 forks source link

[SPARKNLP-1092] Adding support to read HTML files #14449

Open danilojsl opened 2 weeks ago

danilojsl commented 2 weeks ago

Description

This pull request introduces a new feature that enables reading and parsing HTML files into a structured Spark DataFrame. Leveraging this functionality allows for efficient processing and analysis of HTML content, seamlessly integrating with Spark NLP for enhanced downstream natural language processing tasks.

Key Changes

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist: