Improve the UX for bulk uploading and processing of large number of files

sunu commented 2 years ago

Currently while uploading a large number of files or a large archive containing a large number of files through alephclient or the UI, the overall user experience is not great. Here's a list of potential issues the user might face:

The progress bar indicating the processing status of the uploaded documents might get stuck at 99% or 100% (https://github.com/alephdata/aleph/issues/1839) without much insight into what's wrong or how to proceed further
Some of the files might fail to be processed without leaving any hint to the uploader or the viewer.
- This results in an incomplete dataset and the users don't get to know that the dataset is incomplete. This is problematic if the completeness of the dataset is crucial for an investigation.
- There is no way to upload only the files that failed to be processed without re-uploading the entire set of documents or manually making a list of the failed documents and re-uploading them
- There is no way for uploaders or Aleph admins to see an overview of processing errors to figure out why some files are failing to be processed without going through docker logs (which is not very user-friendly)
while uploading a large number of documents through alephclient, some documents might fail to upload due to network errors, timeouts etc. In that case we want upload the missing files to Aleph by comparing the current directory to the files on Aleph(https://github.com/alephdata/alephclient/issues/35)). But that's currently not possible. Same goes for new files in a folder containing lots of files; there is no way to uplaod only the new files to Aleph without uploading the whole folder (https://github.com/alephdata/alephclient/issues/34)
when large archives, mailboxes are uploaded to aleph, ingest-file workers run into rate limits from GCS (https://github.com/alephdata/servicelayer/issues/55)

sunu commented 2 years ago

Some ideas on how to improve the experience:

Maintain and show the history of jobs and problems for each document and collection (More discussion on https://github.com/alephdata/aleph/discussions/1525)
- when a problem occurs while processing a file, it should be reflected on the file, on each of its ancestors and on the collection.
- A collection might show a summary like there were x problems while processing the contents of this collection. A folder and the problematic file should show similar messages and option to expand that message to see the details of the error.
- An overview page that groups similar error for a collection would be nice too but might need some work to implement.
Make sure the worker marks any tasks hanging for too long as failed so we don't have processing status stuck at 99% or 100%
- See https://github.com/alephdata/aleph/issues/2127
https://github.com/alephdata/aleph/issues/1948
https://github.com/alephdata/alephclient/issues/39

brrttwrks commented 2 years ago

+1 about both the UI and alephclient/API - I think from a journo's perspective, the UI should give clear indication of state at any given level and what lies underneath, but also alephclient should provide a way to use that info to re-ingest all or failed documents or to pipe errors to other cli tools to analyze or further process the info:

alephclient stream-entities --failed -f <foreign_id> | jq '.' ...

or something similar.

jlstro commented 2 years ago

We could also think of a way to manually exclude files or folders when using alephclient crawldir? Similar to a gitignore file?

sunu commented 2 years ago

We could also think of a way to manually exclude files or folders when using alephclient crawldir? Similar to a gitignore file?

I have added that issue now @jlstro (https://github.com/alephdata/alephclient/issues/39)

brrttwrks commented 2 years ago

The ability to have include and exclude files like rsync would be sweet via a switch or an include/exclude file that accepted some basic regex like git ignore files or rsync.

akrymets commented 1 year ago

Hi to everyone! Any news on this topic? Sometimes uploading thousands files to an investigation is painful. Thanks!

lyz-code commented 11 months ago

Hi some of my users are complaining that they feel uncomfortable of not knowing for sure if all files are uploaded. I feel that until the whole UX is improved we could at least notify the user of what documents failed to upload. That way the admins could process those files manually and analyze the reason why they failed.

If you like the idea I can make a contribution to implement this

lyz-code commented 7 months ago

Until the issue is solved you can be notified whenever there is an error or warning in the ingest docker logs if you set up Loki, promtail using the json-file driver with the next configuration:

  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        regex: '/(.*)'
        target_label: 'container'
    pipeline_stages:
      - static_labels:
          job: docker

And the next alert:

groups:
  - name: should_fire
    rules:
      - alert: AlephIngestError
        expr: |
          sum by (container) (count_over_time({job="docker", container="aleph_ingest-file_1"} | json | __error__=`` | severity =~ `WARNING|ERROR`[5m])) > 0
        for: 10m
        labels:
            severity: critical
        annotations:
            summary: "Errors found in the {{ $labels.container }} docker log"
            message: "Error in {{ $labels.container }}: {{ $labels.message }}"

lyz-code commented 3 months ago

Until the issue is solved, and assuming that you have loki configured, you can follow the next guidelines to solve some of the ingest errors:

Cannot open image data using Pillow: broken data stream when reading image files: The log trace that has this message also contains a field trace_id which identifies the ingestion process. With that trace_id you can get the first log trace with the field logger = "ingestors.manager" which will contain the file path in the message field. Something similar to Ingestor [<E('9972oiwobhwefoiwefjsldkfwefa45cf5cb585dc4f1471','path_to_the_file_to_ingest.pdf')>]
A traceback with the next string Failed to process: Could not extract PDF file: FileDataError('cannot open broken document'): This log trace has the file path in the message field. Something similar to [<E('9972oiwobhwefoiwefjsldkfwefa45cf5cb585dc4f1471','path_to_the_file_to_ingest.pdf')>] Failed to process: Could not extract PDF file: FileDataError('cannot open broken document')

Once you have the files that triggered the errors, the best way to handle them is to delete them from your investigation and ingest them again.

alephdata / aleph

Improve the UX for bulk uploading and processing of large number of files #2124