archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: `concurrent_instances` is underused #1356

Open sevein opened 6 years ago

sevein commented 6 years ago

https://github.com/artefactual/archivematica/issues/938 introduces a new architecture where workers process file-level tasks in batches. When an instance of MCPClient is allocated one of these batches it looks up whether the client module prefers to be provisioned multiple times (as processes) to process the batch in smaller chunks. The module claims this behaviour by defining a concurrent_instances function that returns the number of instances required.

Currently, this is only used in the following cases:

src/MCPClient/lib/clientScripts/archivematica_clamscan.py
src/MCPClient/lib/clientScripts/characterize_file.py
src/MCPClient/lib/clientScripts/examine_contents.py
src/MCPClient/lib/clientScripts/identify_file_format.py
src/MCPClient/lib/clientScripts/transcribe_file.py
src/MCPClient/lib/clientScripts/validate_file.py

In all the cases above the integer returned comes from multiprocessing.cpu_count().

We may want to revise this in order to:

sevein commented 6 years ago

In artefactual/archivematica#1255, validate_file stops using concurrent_instances to address an issue with MediaConch (see https://github.com/archivematica/Issues/issues/44 for more details).