etalab-ia / piaf-ml

PIAF v2.0 repo for ML development. Main purpose of this repo is to automatically find the best configuration for a QA pipeline of a partner organisation.
MIT License
8 stars 0 forks source link

No more cleaning in JSON text imports #1

Closed psorianom closed 3 years ago

psorianom commented 4 years ago

Here https://github.com/etalab-ia/piaf-ml/blob/c4f8457c8b8e0be7c6daed46ddfca0a058e57c39/src/util/convert_json_files_to_dicts.py#L26 we are no longer cleaning the text. Shoul we add at least the same as haystacks (wiki_clean_text) ?

guillim commented 4 years ago

we can, however it mainly adds / removes lines breaks. I don't know the impact it may have, any opinion on that ?

psorianom commented 4 years ago

indeed. Not really sure, but I believe its better to have it rather than not. I will add it!