1712n / challenge

Challenge Program
65 stars 27 forks source link

Develop Telegram stream source for Spring Cloud Dataflow streams - 500$ #2

Closed evgenydmitriev closed 3 years ago

evgenydmitriev commented 3 years ago

In GitLab by @penpyt on Jul 19, 2018, 15:25

Develop a telegramstream source component, which can be easily integrated into Spring Cloud Dataflow streams.

Toolset: JAVA, Spring, Spring-dataflow, docker

  1. If you want to lock this issue to make sure no one else is working on it, please comment below and send us your resume at careers@incasec.com. After your resume review, we'll add "in progress" tag and assign the issue to you. Upon request, we can also create an escrow job on one of the freelancer websites (Upwork, fl.ru, etc). All of this is optional - you can skip this step if you just want to show us the result.
  2. You need to create a separate personal git project and provide the issue creator with access to your repository for code review.
  3. Upon completion of the project, please add "release" branch, create merge request from master to release, assign the issue creator to it, and leave a comment here.
  4. After resolving all our comments associated with the merge request, we'll release the payment, and move the project into our repository.

Component should

following commands should be provided:

Definition of done

evgenydmitriev commented 3 years ago

In GitLab by @durm on Sep 13, 2018, 13:31

moved from IncaSec/nterminal/cdc/cdc-grabber#77

evgenydmitriev commented 3 years ago

In GitLab by @durm on Sep 13, 2018, 13:33

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @durm on Sep 13, 2018, 13:49

@evgenydmitriev just as an experiment in case of bounty. I created project on fl.ru platform. Got 2 strange non-relevant responses from guys with php, mysql as main skills. (whyyy?))) and 2 guys provided me with time, price and code snippets.

evgenydmitriev commented 3 years ago

In GitLab by @durm on Sep 20, 2018, 18:34

changed title from {-telegram-} to {+Develop telegramstream source for Spring Cloud Dataflow streams - 500$+}

evgenydmitriev commented 3 years ago

In GitLab by @durm on Sep 20, 2018, 18:34

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @durm on Sep 20, 2018, 18:37

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @durm on Sep 20, 2018, 18:37

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @durm on Sep 20, 2018, 18:43

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @durm on Sep 20, 2018, 20:03

changed title from Develop {-telegram-}stream source for Spring Cloud Dataflow streams - 500$ to Develop {+Telegram +}stream source for Spring Cloud Dataflow streams - 500$

evgenydmitriev commented 3 years ago

In GitLab by @durm on Sep 21, 2018, 15:33

@kemi here is clickable schema for telegram API objects. Please, help me formulate the expected schema of output data for telegram source.

just like in the case of Twitter. describe what you need in form of a parameter mapping. {sender, text, timestamp, etc}

you can put in a comment or update my issue directly.

evgenydmitriev commented 3 years ago

In GitLab by @durm on Sep 21, 2018, 15:38

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @durm on Sep 26, 2018, 12:39

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @durm on Sep 26, 2018, 13:17

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @durm on Sep 26, 2018, 13:19

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @myrmecophagous on Sep 27, 2018, 19:43

@durm here is the schema for Telegram:

{
    "type": "object",
    "properties": {
        "id": {
            "type": "string",
            "description": "Message ID"
        },
        "date": {
            "type": "string",
            "description": "Message timestamp, ISO date in UTC"
        },
        "source": {
            "type": "string",
            "description": "Predefined string, the same for all messages",
            "default": "Telegram"
        },
        "category": {
            "type": "string",
            "description": "Channel title; we should be able to pass it as parameter from configs"
        },
        "channel_id": {
            "type": "string",
            "description": "Channel id"
        },
        "author": {
            "type": "string",
            "description": "Message sender id"
        },
        "reciever": {
            "type": "string",
            "description": "Message reciever id"
        },
        "content": {
            "type": "string",
            "description": "Message text"
        },
        "related_documents": {
            "type": "array",
            "description": "messageMediaDocument",
            "items": {
                "type": "object",
                "properties": {
                    "content": {
                        "type": "string",
                        "description": "Document text"
                    },
                    "date": {
                        "type": "string",
                        "description": "Document _creation_ date, if available; ISO date in UTC"
                    },
                    "size": {
                        "type": "integer",
                        "description": "File size in bytes"
                    },
                    "file_name": {
                        "type": "string"
                    }
                }
            }
        },
        "media": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "content": {
                        "type": "string",
                        "description": "messageMediaPhoto binary content or messageMediaVideo.thumb"
                    },
                    "description": {
                        "type": "string",
                        "description": "Media caption"
                    },
                    "date": {
                        "type": "string",
                        "description": "Media _creation_ date, if available; ISO date in UTC"
                    },
                    "type": {
                        "type": "string",
                        "description": "Media type: [image|video]"
                    },
                    "size": {
                        "type": "integer",
                        "description": "File size in bytes"
                    }
                }
            }
        }
    },
    "required": [
        "id",
        "date",
        "source",
        "category",
        "author",
        "language",
        "content"
    ]
}

I'm not sure if reciever is a relevant piece of information, if we're gonna listen to groups.

As for media and document date, it should be the date when the media was created / updated, not the timestamp when the it was posted to the chat, since we already have this information; something similar to timestamps we extract from PDF meta.

evgenydmitriev commented 3 years ago

In GitLab by @durm on Sep 28, 2018, 11:01

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @aturok on Oct 12, 2018, 19:22

As discussed with @durm , I'd like to claim this task.

In regards to Telegram integration there are certain concerns:

  1. Telegram bots are allowed to get messages from a public channel only as long as the channel admin adds the bot to the subscribers list. As far as I understand, this is not something feasible in the scope of the project. Thus we will have to use the Telegram API/TDLib provided for creating client apps instead of the Telegram Bot API
  2. The client API is supposed to be used with a telegram user to log in, which means that for our source app to work we will need to have an active(verified) telegram user and to be able to configure the app accordingly. Which in turn may imply some auth-related mess. And will most certainly require to change the set of configuration parameters for the app.

I am currently investigating what can be done to implement the requested scenario properly - will keep you posted.

In the meantime, I've got a couple questions in regards to the desired outcoming message structure:

  1. The source field - should it be hardcoded or we want a configuration paramter for it?
  2. As you have noted, the receiver field will have to be empty in most cases and doesn't seem relevant in the telegram context (especially if we speak of channels which are one-direction means of communication)
  3. The language parameter - do you expect it to be configured on per-channel basis or to be deduced from the message content? If we speak of the second option, do we want to inject the language-detection mechanisms in the tg source app or maybe it's more reasonable to craft a separate app for this purpose?
evgenydmitriev commented 3 years ago

In GitLab by @durm on Oct 12, 2018, 20:05

@myrmecophagous please, assist

evgenydmitriev commented 3 years ago

In GitLab by @myrmecophagous on Oct 12, 2018, 20:17

@aturok

  1. source field could perfectly be hardcoded.
  2. I think an empty receiver is okay, let's keep it in the schema though.
  3. We wouldn't need a language identification service, this parameter should be defined for each channel and default to "en".
evgenydmitriev commented 3 years ago

In GitLab by @durm on Oct 12, 2018, 22:37

@myrmecophagous there are two possible ways:

for me first one looks good.

evgenydmitriev commented 3 years ago

In GitLab by @durm on Oct 12, 2018, 23:42

@myrmecophagous otherwise, for source logic language is not required at all. So, it shouldn't be provided as input param.

Why do we need to return message with language that we just provide as input argument? let's provide it to component which will consume this data directly.

evgenydmitriev commented 3 years ago

In GitLab by @myrmecophagous on Oct 13, 2018, 01:37

@durm In this case, instead of the language code, we'd need the channel id (along with its name) in the output.

evgenydmitriev commented 3 years ago

In GitLab by @durm on Oct 13, 2018, 08:57

assigned to @aturok

evgenydmitriev commented 3 years ago

In GitLab by @myrmecophagous on Oct 16, 2018, 11:00

Schema updated.

evgenydmitriev commented 3 years ago

In GitLab by @aturok on Oct 25, 2018, 01:35

@durm @myrmecophagous please find a brief status report below.

I was able to make TDLib - the official client-library for Telegram API work locally. It allows to receive messages (unlike Telegram Bot API) and should work for our task. Three problems with it though:

  1. It works with Java through JNI, which may cause some problems around wrapping it into a Spring Boot / SCDF application. I wasn't able to find a solid confirmation that JNI works ok with Spring apps, but looks like that - will verify.
  2. The native part of the library is architecture-dependent and a bit of a mess build-wise. Most likely we can work around this by wrapping prebuilt binaries in a docker container.
  3. The most severe problem so far. The Telegram API requires authorization ritual to be performed before the client can receive any messages. The authorization procedure involves submitting a phone number, getting a confirmation code via sms/telegram message and submitting the code back to telegram. To make things worse, the resulting authorization is expiring and there are no official claims as to when it can expire, which means any time. The key problem is that we cant build the authorization workflow into a normal automated SCDF source application life-cycle (read parameters, spin up, work, get stopped) gracefully - due to the need to provide the manually obtained confirmation code to the app after it has been started. Do you guys have any ideas/vision as to how we should properly handle this issue?
evgenydmitriev commented 3 years ago

In GitLab by @anshlykov on Oct 25, 2018, 06:49

@aturok Actually, I do not fully understand why TDLib instead of Bot API? Isn't it too excessive and aren't we making it too complicated?

@evgenydmitriev If we still choose TDLib, then I see no insoluble problems in paragraphs 1 and 2.

evgenydmitriev commented 3 years ago

In GitLab by @aturok on Oct 25, 2018, 13:58

@evgenydmitriev not sure about Google Voice, but I believe they should work - will check with the number that you have provided.

@anshlykov you're right it gets too complicated, but the issue with Bot API is that bots can be subscribed to telegram channels only by channel administrators, meaning you would have to ask the owner of every channel that you're interested in to add your bot to the channel. And if they refuse, you can't do anything. Should this be acceptable for our solution, I would definitely go with Bot API - it's way-way simpler.

evgenydmitriev commented 3 years ago

In GitLab by @aturok on Nov 6, 2018, 12:02

Current experiments are in this repo: https://github.com/aturok/tgsourcecheck/commits/tgsource (took TDLib repo for a start). Next steps: spearate from TDLib source, wrap into SCDF harness.

This week will also share a repo with the IRC source app (in the appropriate issue)

evgenydmitriev commented 3 years ago

In GitLab by @evgenydmitriev on Apr 29, 2019, 22:46

closed