JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.87k stars 710 forks source link

Improve CoNLL reader to read large datasets #6331

Closed ethnhll closed 2 years ago

ethnhll commented 3 years ago

Is your feature request related to a problem? Please describe. We are attempting to load a directory of CoNLL files totalling 9000+ files and 400+MB using CoNLL().readDataset(spark, dataset), however this fails with OOM exceptions even on a driver with 24GB of RAM available. The proposed workaround to be able to load this dataset all at once for training is the following:

[R]ead the files one by one, and then write as parquet, and then read [all] at once

Describe the solution you'd like I would like to be able to avoid reading in individual files into a dataframe, writing the dataframes to parquet, then re-reading those files back into a dataframe. This approach feels more like a workaround against the spirit of what the CoNLL reader should be able to do. It would be wonderful for the reader to have some extra params, or methods on the CoNLL class that allow for efficiently loading larger CoNLL training sets.

maziyarpanahi commented 3 years ago

Thanks @ethnhll for the feature request. We clearly need to optimize CoNLL() class for larger files and be more efficient to load multiple files (if it's already possible)

I will add this to our list to make sure the helpers in the training module can handle large file(s).

(probably a param to enable on-disk to read, save a checkpoint on disk, and at the end read them all back)

ethnhll commented 2 years ago

Is there any path forward for reading in lots of CoNLL text files and converting to parquet quickly, while a change to the CoNLL reader is made?

so far I just have something like this:

def conll_to_parquet(spark, filename, conll_parquet):
    training_data = CoNLL().readDataset(spark, filename)
    output = os.path.join(conll_parquet, os.path.basename(filename))
    training_data.write.mode("overwrite").parquet(output)

for row in files.collect():
    conll_to_parquet(spark, row['value'], conll_parquet)

but this is unbelievably slow, taking multiple hours for converting a few hundred megabytes...

maziyarpanahi commented 2 years ago

Hi @ethnhll

At the moment this is the only way to have multiple conll files inside CoNLL() class, but we are working on optimizing that class as much as possible to be faster with less memory consumption for larger files, and also see if we can add an ability to read multiple files from a directory instead of 1 at a time.

albertoandreottiATgmail commented 2 years ago

@ethnhll , @maziyarpanahi we have a candidate implementation in healthcare library, will share with Ethan soon.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

amir1m commented 2 years ago

@albertoandreottiATgmail

@ethnhll , @maziyarpanahi we have a candidate implementation in healthcare library, will share with Ethan soon.

Could you please share the implementation build/version details on how we can use it?

maziyarpanahi commented 2 years ago

@amir1m it's been already released: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/3.3.3

PR with an example: https://github.com/JohnSnowLabs/spark-nlp/pull/6482

amir1m commented 2 years ago

@amir1m it's been already released: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/3.3.3

PR with an example: #6482

Hi @maziyarpanahi , My query is more about large CONLL file (one or many) that results in OOM exception. I am trying to load conll file with 800m with 16gb Spark memory from a Jupyter notebook and its throwing OOM exception.

maziyarpanahi commented 2 years ago

I don't know if that's possible, at some point the memory has to be enough for the file that is being processed. The changes here were to support multiple CoNLL files and speed up by caching and processing in parallel.

I can only recommend increasing the memory and at the same time breaking your large CoNLL file into smaller ones so they can be processed. (at this point there is no other way)