aphp / edsnlp

Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes.
https://aphp.github.io/edsnlp/
BSD 3-Clause "New" or "Revised" License
111 stars 29 forks source link

write_parquet filesystem error #298

Open Eliseliuaphp opened 2 months ago

Eliseliuaphp commented 2 months ago

Description

Write_parquet connexion to hdfs impossible due to filesystem error The problem was solved when adding file_system arg in the write_parquet function

How to reproduce the bug

import edsnlp

df=pd.DataFrame(data=[('a',1,'test'),
                     ('a',1,'test'),
                     ('a',1,'test'),
                     ('a',1,'test'),
                     ('a',1,'test'),
                     ('a',1,'test'),
                     ('a',1,'test')],
                     columns=['name','id','test'])

docs=edsnlp.data.from_pandas(df)
edsnlp.data.write_parquet(docs, 'hdfs://bbsedsi/user/<PATH_TO_PARQUET>')

Solved the problem

import fsspec

hdfs = fsspec.get_filesystem_class("hdfs")()

docs=edsnlp.data.from_pandas(df)
edsnlp.data.write_parquet(docs, 'hdfs://user/<PATH_TO_PARQUET>', filesystem=hdfs)