Closed ethnhll closed 2 years ago
Thanks @ethnhll for the feature request. We clearly need to optimize CoNLL() class for larger files and be more efficient to load multiple files (if it's already possible)
I will add this to our list to make sure the helpers in the training
module can handle large file(s).
(probably a param to enable on-disk to read, save a checkpoint on disk, and at the end read them all back)
Is there any path forward for reading in lots of CoNLL text files and converting to parquet quickly, while a change to the CoNLL reader is made?
so far I just have something like this:
def conll_to_parquet(spark, filename, conll_parquet):
training_data = CoNLL().readDataset(spark, filename)
output = os.path.join(conll_parquet, os.path.basename(filename))
training_data.write.mode("overwrite").parquet(output)
for row in files.collect():
conll_to_parquet(spark, row['value'], conll_parquet)
but this is unbelievably slow, taking multiple hours for converting a few hundred megabytes...
Hi @ethnhll
At the moment this is the only way to have multiple conll files inside CoNLL() class, but we are working on optimizing that class as much as possible to be faster with less memory consumption for larger files, and also see if we can add an ability to read multiple files from a directory instead of 1 at a time.
@ethnhll , @maziyarpanahi we have a candidate implementation in healthcare library, will share with Ethan soon.
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days
@albertoandreottiATgmail
@ethnhll , @maziyarpanahi we have a candidate implementation in healthcare library, will share with Ethan soon.
Could you please share the implementation build/version details on how we can use it?
@amir1m it's been already released: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/3.3.3
PR with an example: https://github.com/JohnSnowLabs/spark-nlp/pull/6482
@amir1m it's been already released: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/3.3.3
PR with an example: #6482
Hi @maziyarpanahi , My query is more about large CONLL file (one or many) that results in OOM exception. I am trying to load conll file with 800m with 16gb Spark memory from a Jupyter notebook and its throwing OOM exception.
I don't know if that's possible, at some point the memory has to be enough for the file that is being processed. The changes here were to support multiple CoNLL files and speed up by caching and processing in parallel.
I can only recommend increasing the memory and at the same time breaking your large CoNLL file into smaller ones so they can be processed. (at this point there is no other way)
Is your feature request related to a problem? Please describe. We are attempting to load a directory of CoNLL files totalling 9000+ files and 400+MB using
CoNLL().readDataset(spark, dataset)
, however this fails with OOM exceptions even on a driver with 24GB of RAM available. The proposed workaround to be able to load this dataset all at once for training is the following:Describe the solution you'd like I would like to be able to avoid reading in individual files into a dataframe, writing the dataframes to parquet, then re-reading those files back into a dataframe. This approach feels more like a workaround against the spirit of what the CoNLL reader should be able to do. It would be wonderful for the reader to have some extra params, or methods on the
CoNLL
class that allow for efficiently loading larger CoNLL training sets.