alephdata / aleph

Search and browse documents and data; find the people and companies you look for.
http://docs.aleph.occrp.org
MIT License
2.01k stars 270 forks source link

Improve the UX for bulk uploading and processing of large number of files #2124

Open sunu opened 2 years ago

sunu commented 2 years ago

Currently while uploading a large number of files or a large archive containing a large number of files through alephclient or the UI, the overall user experience is not great. Here's a list of potential issues the user might face:

sunu commented 2 years ago

Some ideas on how to improve the experience:

brrttwrks commented 2 years ago

+1 about both the UI and alephclient/API - I think from a journo's perspective, the UI should give clear indication of state at any given level and what lies underneath, but also alephclient should provide a way to use that info to re-ingest all or failed documents or to pipe errors to other cli tools to analyze or further process the info:

alephclient stream-entities --failed -f <foreign_id> | jq '.' ...

or something similar.

jlstro commented 2 years ago

We could also think of a way to manually exclude files or folders when using alephclient crawldir? Similar to a gitignore file?

sunu commented 2 years ago

We could also think of a way to manually exclude files or folders when using alephclient crawldir? Similar to a gitignore file?

I have added that issue now @jlstro (https://github.com/alephdata/alephclient/issues/39)

brrttwrks commented 2 years ago

The ability to have include and exclude files like rsync would be sweet via a switch or an include/exclude file that accepted some basic regex like git ignore files or rsync.

akrymets commented 1 year ago

Hi to everyone! Any news on this topic? Sometimes uploading thousands files to an investigation is painful. Thanks!

lyz-code commented 11 months ago

Hi some of my users are complaining that they feel uncomfortable of not knowing for sure if all files are uploaded. I feel that until the whole UX is improved we could at least notify the user of what documents failed to upload. That way the admins could process those files manually and analyze the reason why they failed.

If you like the idea I can make a contribution to implement this

lyz-code commented 7 months ago

Until the issue is solved you can be notified whenever there is an error or warning in the ingest docker logs if you set up Loki, promtail using the json-file driver with the next configuration:

  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        regex: '/(.*)'
        target_label: 'container'
    pipeline_stages:
      - static_labels:
          job: docker

And the next alert:

groups:
  - name: should_fire
    rules:
      - alert: AlephIngestError
        expr: |
          sum by (container) (count_over_time({job="docker", container="aleph_ingest-file_1"} | json | __error__=`` | severity =~ `WARNING|ERROR`[5m])) > 0
        for: 10m
        labels:
            severity: critical
        annotations:
            summary: "Errors found in the {{ $labels.container }} docker log"
            message: "Error in {{ $labels.container }}: {{ $labels.message }}"
lyz-code commented 3 months ago

Until the issue is solved, and assuming that you have loki configured, you can follow the next guidelines to solve some of the ingest errors:

Once you have the files that triggered the errors, the best way to handle them is to delete them from your investigation and ingest them again.