digitalmethodsinitiative / 4cat

The 4CAT Capture and Analysis Toolkit provides modular data capture & analysis for a variety of social media platforms.
Other
246 stars 59 forks source link

Media upload datasource! #419

Closed dale-wahl closed 3 months ago

dale-wahl commented 6 months ago

I created a datasource with minor edits to the frontend that allows uploading bulk files. The user can choose what type of files they are uploading "video", "image", or "audio". I have updated the relevant processors (hopefully all) to identify and work on their appropriate datasets.

Open questions: 1) I left room to validate filetypes. Right now you can upload any filetype and in fact mix media types (though most processors will fail as they do not all check filetypes themselves). We could validate either prior to archiving them (frontend) or afterwards (backend), but I would like to be inclusive as possible so am not sure what the best method would be (but did find this cool package).

2) How should we best use is_compatible_with to ensure a DataSet is compatible? Normally we do something like check the DataSet type which is copied from the processor, but that fails us in this instance (and can be clumsy when we have a variety of processors producing similar outputs). I wanted to instead add a media_type attribute to a Dataset and ultimately did via a DataSet's parameters. There were some oddities in behaviour.

DataSet class has getattr and setattr methods and I thought to utilize that. E.g., the image-downloader can do something like self.dataset.media_type = 'image' in the process/get_items method. That uses the setattr and self.media_type for that instance works, but not later when instantiated again. The setattr does add media_type to the parameters and it is saved that to the database. This means later instantiations can use dataset.parameters.get('media_type'). It feels weird to set an attribute and have it not be loaded again later.

I am currently using it this way by creating a get_media_type method to DataSet which checks self.media_type and, if that does not exist, it uses self.parameters.get("media_type"). This method could also be used to check for media types in a variety of methods (e.g., infer them from the extension, actually sniff files, check the Processor that created the DataSet, etc.). I could alternately update the getattr method to check parameters so dataset.media_type always returns, but that could have weird consequences depending on query parameters provided per Processor. Perhaps this would be a good time to divide parameters into something like "query parameters" and "4cat parameters" (those odd parameters assigned and used by 4CAT that are hidden manually in the frontend).

dale-wahl commented 5 months ago

Merged master into this branch and ran some tests. All seems well.

Essentially, the PR consists of adding a new user input type OPTION_FILES = "files" and handling that type in datasource-options.html. Then the datasource/import_media.py file. Rather simple. The changes to processors are basically adding is_compatible_with to processors that did not have it (and thus defaulted to compatible with any top_dataset) and ensuring processors would not run on this new data type unless they ought to via dataset.get_media_type() (defaults to text which is everything but the download processors and this new datasource).

Should we limit upload types via either introducing that mime type package or manually listing extensions? Or, I guess, leaving it entirely up to the user?

dale-wahl commented 5 months ago

Ok, Simplified the user options. Using mimetypes.guess_type(filename) to identify media_type (could not sniff the files in validate_query as we do not have full files at that point). And allowing zip archive files to be uploaded (multiple and in conjunction with ordinary files if desired). I cannot guess mime types of zipped files however (same issue of not having the full zip archive). We could check after they have been saved/uploaded, but I am not sure the benefit to that.

dale-wahl commented 4 months ago

Added "preset" for Audio to Text.