biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3
Other
125 stars 82 forks source link

Save the result of "Preprocessing Text" #590

Open PrimozGodec opened 3 years ago

PrimozGodec commented 3 years ago

@Rezabagheriloye reported in https://github.com/biolab/orange3/issues/5035:

I want to save the result of "Preprocessing Text" of the Corpus in CSV or TXT format. I used "Save Data" while the result was the same as the original corpus. I was wondering if anyone shows me how to save clean data which is the result of "Preprocessing Text".

I wanted to suggest the solution with pickling the Corpus and discovered two new issues:

ajdapretnar commented 3 years ago

Should we disable the possibility to save corpus to CSV, TAB, ... and allow only .pkl like it is made for sparse? Users are confused when they save corpus to csv and the discover that preprocessing is not stored together with the corpus.

This would disable saving the downloaded corpus from Twitter, Wikipedia and other similar widgets to csv. Not in favour of removing.

While I agree it is slightly confusing, I think it is common practice (in NLTK for example) to have a separate tokens object. I'd rather give a warning or describe this better in the docs.

PrimozGodec commented 3 years ago

Agree with you @ajdapretnar, I would definitely add a warning to the widget.

ajdapretnar commented 2 years ago

An idea: if Corpus on the input of Save Data, the widget raises a warning saying "To keep preprocessing save as pickle (.pckls)". Should be implemented in orange3.