Open sunu opened 2 years ago
Some ideas on how to improve the experience:
there were x problems while processing the contents of this collection
. A folder and the problematic file should show similar messages and option to expand that message to see the details of the error.+1 about both the UI and alephclient/API - I think from a journo's perspective, the UI should give clear indication of state at any given level and what lies underneath, but also alephclient should provide a way to use that info to re-ingest all or failed documents or to pipe errors to other cli tools to analyze or further process the info:
alephclient stream-entities --failed -f <foreign_id> | jq '.' ...
or something similar.
We could also think of a way to manually exclude files or folders when using alephclient crawldir
? Similar to a gitignore file?
We could also think of a way to manually exclude files or folders when using
alephclient crawldir
? Similar to a gitignore file?
I have added that issue now @jlstro (https://github.com/alephdata/alephclient/issues/39)
The ability to have include and exclude files like rsync would be sweet via a switch or an include/exclude file that accepted some basic regex like git ignore files or rsync.
Hi to everyone! Any news on this topic? Sometimes uploading thousands files to an investigation is painful. Thanks!
Hi some of my users are complaining that they feel uncomfortable of not knowing for sure if all files are uploaded. I feel that until the whole UX is improved we could at least notify the user of what documents failed to upload. That way the admins could process those files manually and analyze the reason why they failed.
If you like the idea I can make a contribution to implement this
Until the issue is solved you can be notified whenever there is an error or warning in the ingest docker logs if you set up Loki, promtail using the json-file driver with the next configuration:
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
- source_labels: ['__meta_docker_container_name']
regex: '/(.*)'
target_label: 'container'
pipeline_stages:
- static_labels:
job: docker
And the next alert:
groups:
- name: should_fire
rules:
- alert: AlephIngestError
expr: |
sum by (container) (count_over_time({job="docker", container="aleph_ingest-file_1"} | json | __error__=`` | severity =~ `WARNING|ERROR`[5m])) > 0
for: 10m
labels:
severity: critical
annotations:
summary: "Errors found in the {{ $labels.container }} docker log"
message: "Error in {{ $labels.container }}: {{ $labels.message }}"
Until the issue is solved, and assuming that you have loki configured, you can follow the next guidelines to solve some of the ingest errors:
Cannot open image data using Pillow: broken data stream when reading image files
: The log trace that has this message also contains a field trace_id
which identifies the ingestion process. With that trace_id
you can get the first log trace with the field logger = "ingestors.manager"
which will contain the file path in the message
field. Something similar to Ingestor [<E('9972oiwobhwefoiwefjsldkfwefa45cf5cb585dc4f1471','path_to_the_file_to_ingest.pdf')>]
Failed to process: Could not extract PDF file: FileDataError('cannot open broken document')
: This log trace has the file path in the message
field. Something similar to [<E('9972oiwobhwefoiwefjsldkfwefa45cf5cb585dc4f1471','path_to_the_file_to_ingest.pdf')>] Failed to process: Could not extract PDF file: FileDataError('cannot open broken document')
Once you have the files that triggered the errors, the best way to handle them is to delete them from your investigation and ingest them again.
Currently while uploading a large number of files or a large archive containing a large number of files through alephclient or the UI, the overall user experience is not great. Here's a list of potential issues the user might face: