Programie / Telegram2Elastic

A simple Telegram client writing chat messages to an Elasticsearch instance in realtime
MIT License
15 stars 6 forks source link

What about Media files ? #3

Closed benborges closed 1 year ago

benborges commented 1 year ago

This is a very interesting project, I was just wondering if there is a way to configure the output so that it also download media files ?

it would be quite the improvement to be able to specify a volume to host the media files and be able to have it all tied with the ES database and perhaps a Kibana front to search the content with media files outside of telegram.

Programie commented 1 year ago

Downloading the media files additionally to the messages sounds like a nice feature to implement.

I could implement downloading the files and put them in a directory and write the path to it into Elasticsearch.

Programie commented 1 year ago

I've implemented downloading media files in 0053bee. But it needs some more testing before releasing a new version.

benborges commented 1 year ago

Oh this is great news, i will test it and report back.. It would be totally amazing to be able to use a remote volume storage as a destination to the media files to avoid filling the host system.

I'll investigate how i can get this working in my environment!

Programie commented 1 year ago

It would be totally amazing to be able to use a remote volume storage as a destination to the media files to avoid filling the host system.

I guess you mean something like S3?

Maybe I could implement something to allow configuring different types of media outputs just like the current implementation of output writers.

Additionally to that, I could also imagine adding options to have a better control over media download. For example, to only download media from specific chats, limit it by max size or media/file type (i.e. only download images but no videos), etc.

benborges commented 1 year ago

yes S3/Minio or in my use case, hetzner storage box, that are mounted locally on my host and available /mnt/volumes/

regarding media files, I'm interested in a datahoarding approach but I know that Telethon allows to specify which kind of media files to get (video, images, documents, audio) and this is generally a neat idea to add to cover more use cases.

Programie commented 1 year ago

If the volume is mounted locally, you can simply specify the path to it in the config.yml file (media > download_path).

benborges commented 1 year ago

thanks alot, will test & try this out !

benborges commented 1 year ago

BTW, regarding different types of "multi" media posts on telegram I refer to here It's worth having a look at how Telethon handles this and if it does it properly for multi-media embedded posts, in many python code that use telethon I have used, they were fine to download single media posts, but the moment they were many media files in the same posts, often only the first media files would be downloaded, also not sure how to store multi-media files that belong to the same archived posts inside ES but it's worth thinking about it, I would be happy to test this out

Programie commented 1 year ago

I just implemented limiting the file size for media downloads. For example, you can skip files larger than 10 MB or only download photos.

While testing that, I've also tested what will happen if there is a message with multiple photos grouped together in a single message. Telethon simply generates a separate message for each photo in that group. So if there are 3 photos sent together, Telethon receives those photos in 3 separate messages. Therefore, hopefully there shouldn't be any issue with multiple media files in the same post.

Programie commented 1 year ago

I've improved the media download a lot over the last few days. For example, there is now a rule based configuration which allows to define what should be downloaded.

For example, you might configure it to download files matching any image mime type (image/*, i.e. image/jpeg) and limit downloads for videos to 10 MB per file by using the following configuration (the _re suffix defines to use regular expression instead of an exact string to match the mime type):

media:
  download_path: /path/where/to/put/media-files
  rules:
    - mime_type_re: image/
    - mime_type_re: video/
      max_size: 10M