Use CSV as the input and output format of both classifier and lda modeling

dnaaun commented 4 years ago

Right now, classifier.py uses CSV, while lda.py uses XLSX for output.

While we can write code elsewhere(ie, in the Flask related files) to handle that, it would make more sense to use a consistent file format for all modeling, and handle file type conversion uniformly at the Flask level(as opposed to having different code to process lda related files and classifier related files). This is scalable in case we(or others) want to add more modeling features in the future, since the Flask code will be more decoupled from how modeling does it's thing.

I propose we use only CSV as input and output for all modeling related code (specifically Corpus, LDAModeler, ClassificationDataset, and ClassifierModel will have to take in, and spit out, only CSV).

@lentil-soup , do you think you can strike this(and how soon do you think)? It's self contained, you won't have to touch Flask related code.

If you can indeed take this, can I ask that you branch off of your lda-metrics branch? And, before you branch off, can be sure to delete this piece of (what I believe is) testing code that you were using on your machine to make sure things work?

https://github.com/davidatbu/openFraming/blob/7a2d89a29c4575821f3407702f13fef8255fdc7d/backend/flask_app/modeling/lda.py#L382-L397

monajalal commented 4 years ago

essentially you should be able to read the input either as csv, xls, or xlsx. In the frontend, we only allow the user to upload a file that has csv, xls, or xlsx extensions.

In the backend, you should check what extension the file is from these three formats and use the appropriate file opener and once it is in df form (dataframe), it doesn't really matter.

dnaaun commented 4 years ago

Hi @monajalal , this doesn't affect the frontend. This is only regarding how the modeling part interacts with the backend.

The interaction of the backend and the frontend is as you described, already(ie, it's already implemented like that).

To be more clear: We can look at our codebase as divided into four: modeling(code performing classificication and topic modeling), scheduling(code to handle spawning workers and queuing background jobs), backend(code to deal with serving API requests), and frontend.

This issue is about the interaction between the modeling and the backend.

dnaaun commented 4 years ago

@lentil-soup , any chance you could take this on? It would really help out with ironing out the last few wrinkles of the API.

asmithh commented 4 years ago

If we want this to be human-readable and usable by people who aren't computer scientists, we probably don't want to be outputting .csv files. It would probably be better to accept a variety of inputs and output only Excel files, as those are more human-readable.

On Thu, Jul 9, 2020 at 4:37 PM davidatbu notifications@github.com wrote:

@lentil-soup https://github.com/lentil-soup , any chance you could take this on? It would really help out with ironing out the last few wrinkles of the API.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/davidatbu/openFraming/issues/195#issuecomment-656338996, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO2OVYBYWRT7AV6F67QXFKTR2YTABANCNFSM4OURDZXQ .

dnaaun commented 4 years ago

Hi @lentil-soup , yes, the API specifies that the user can download files (prediction results, lda results, etc) in any format that they want(we will default to XLSX), in additoin, the user can upload the file in any format that they want.

This is about what we do internally, ie, how we store, and pass around the files. For that, CSV is the best choice because it's lightweight, and easy to manipulate. Whatever format the user uploads, the backend part will convert it to CSV for internal processing. That is why I asked that all the modeling take in and output CSV files.

dnaaun / openFraming

Use CSV as the input and output format of both classifier and lda modeling #195