Medium Source Integration - $250 #27

Closed evgenydmitriev closed 3 years ago

In GitLab by @ngans20 on Jul 24, 2019, 20:31

Bounty Description

This task is for creating a CDC (Content Delivery Chain) Source that regularly collects, normalizes, and forwards the data to Nakamoto Terminal’s pipeline.
You should start by forking the "medium-task" project that was created specifically for this bounty issue.
Following this guide, you should modify the forked project to include the required functionality described below.
When done, you submit your work by setting a deployment::ready tag in your merge request.

Important Information/resources

This source module will be used to improve agent data collection for Yupana (see internal source-integration issue)
Starting list of Medium accounts and agent tags
Other Agent Sources: reddit-agents, messari, icoholder
Current NLP module documentation
Medium API
Please make sure your scraping method is robust and can withstand slight changes in the page layout.

Functionality

Agent Metadata

Bio (with any linked social media accounts)
Number following/Followed by
Lists of accounts following (we don't need the followers)
Number of Articles
- Number of claps and number of comments if possible
Publications (that the user can edit)
Any Top Writer badges

General rules

Anyone can participate in getting this bounty. You do not need our approval to start working or to submit your results.
When you start working on the issue, please comment below to let everyone know that there might be potential competition.
When you are ready to submit your work, leave a comment in this issue with the link to your document and we'll get in touch with you regarding the bounty release or to help you make necessary changes to your submission.
We will pay the bounty as soon as we get a good quality submission that fulfills all of the requirements listed here.
By completing this project you agree to let Inca use any and all work submitted for any internal or external purposes. Inca reserves the right to use or not use any work submitted via this project.
For additional information about the Bounty program, please refer to our wiki page.

Background

Inca often uses "bounty projects" as introductory projects to vet potential employees or interns. These projects give interested individuals a chance to prove themselves, learn a bit about our company & products, and produce a useful result in the process. These projects are extremely independent and will require you to manage your own time and work process.

NTerminal is a data aggregation and analytics platform used for navigating the crypto-financial ecosystem. NTerminal's many data streams can be categorized into three general segments:

Financial data - Trade and order book data from exchanges and aggregation entities (price, size, trade-pair, volumes, etc.)
Natural Language data - Text-based data streams with keyword & sentiment analysis (social media posts, news articles, regulatory meeting minutes, etc.)
Technical data from Blockchains, mining pools, code fuzzing, github repositories, etc.

Resources

[Normalization Guidelines]()
Project to fork and work on
Guide on creating a CDC (Content Delivery Chain) Source
Bounty Program Wiki
Inca Digital Securities website
NTerminal website

Don't hesitate to ask us questions by commenting in this issue or emailing us at bounty@incasec.com.

In GitLab by @ngans20 on Jul 24, 2019, 20:31

changed the description

In GitLab by @ngans20 on Jul 24, 2019, 20:34

changed the description

In GitLab by @anshlykov on Jul 24, 2019, 22:37

@ngans20

Any Top Writer badges

Can you give an example?

NLP events (to go through our nlp module)

Publications (with number of claps and number of comments)

comments

We cannot integrate this data through the new NLP module right now. I think we can do it in another task later.

In GitLab by @ngans20 on Jul 24, 2019, 22:42

Any Top Writer badges

@zfinzi told me to include this. Zach can you share thoughts?

We cannot integrate this data through the new NLP module right now. I think we can do it in another task later.

Okay, sounds good. I'll slim this one down to just Agent metadata then!

In GitLab by @ngans20 on Jul 24, 2019, 22:42

changed the description

In GitLab by @ngans20 on Jul 24, 2019, 22:45

changed the description

In GitLab by @zfinzi on Jul 25, 2019, 14:23

changed the description

In GitLab by @zfinzi on Jul 25, 2019, 14:25

changed the description

In GitLab by @anshlykov on Jul 29, 2019, 10:48

Zach can you share thoughts?

@zfinzi

In GitLab by @zfinzi on Jul 29, 2019, 13:22

@anshlykov @ngans20 sorry for the delayed response. It is a designation by Medium for the most influential writers within a given topic, should be found on a medium account page. Here is a description.

I don't think it is the most important data point, but I see no harm in integrating this data if it can easily be pulled.

In GitLab by @anshlykov on Jul 29, 2019, 19:31

Now everything is clear, thank you. I will send this issue to my buddy.

In GitLab by @anshlykov on Jul 29, 2019, 19:32

changed title from Medium Source Integration to Medium Source Integration{+ - $250+}

In GitLab by @dima.sazhin on Aug 6, 2019, 13:21

@ngans20 @zfinzi @anshlykov

1.

Lists of accounts following & followed by

Do you mean the list of ids?

2.

Publications (that the user can edit)

Do you going to do it in another issue or not? https://gitlab.com/IncaOutsourcing/bounty/issues/33#note_195496593

Do you need people from a blog or a blog? https://medium.com/bitstamp-blog

In GitLab by @dima.sazhin on Aug 6, 2019, 13:35

assigned to @dima.sazhin

In GitLab by @zfinzi on Aug 6, 2019, 18:12

@dima.sazhin Thank you for claiming this bounty! Here are some answers to your questions:

We're looking for a list of usernames.
For publications, (this is related to the 3rd question) it would be great to know if a given user is an editor or writer for a publication. The comment you referred to is in relation to article posts and comments (that will require a separate module).
We need both the people from the blog and the blog, specifically it would be good to have information on the editors and writers of a blog (or publication).

In GitLab by @evgenydmitriev on Aug 6, 2019, 18:58

To clarify

We're looking for a list of usernames.

Unique IDs that cannot be changed are the priority in any data source, display usernames are useful, but are less important. Not sure how it works in Medium though.

In GitLab by @dima.sazhin on Aug 8, 2019, 14:20

I will collect ids if you don't mind.

I want to clarify one thing. Are you sure you want to store all relations to other objects in the same object?

In GitLab by @ngans20 on Aug 8, 2019, 15:49

@zfinzi & @evgenydmitriev - you guys know better than me here

In GitLab by @evgenydmitriev on Aug 8, 2019, 15:57

I will collect ids if you don't mind

Yes, IDs are the priority. We still need readable names as an extra field though, especially if the user can change them over time.

Are you sure you want to store all relations to other objects in the same object?

I don't see any other options. Those relationships describe the object and are unique to it. In terms of additional processing and potential splitting into other objects, this will be done by other modules down the CDC pipeline. Feel free to suggest other approaches though.

The main purpose of a source component is to periodically collect and normalize all external information so other modules don't need to talk to the outside world.

In GitLab by @evgenydmitriev on Aug 12, 2019, 18:08

changed the description

In GitLab by @evgenydmitriev on Aug 12, 2019, 18:10

I modified the requirements to make things easier:

Lists of accounts following (we don't need the followers)

Also, not a hard requirement, but I agree with @dima.sazhin that sending agent connections in separate messages is a more scalable way of doing things.

In GitLab by @anshlykov on Aug 26, 2019, 18:05

@ngans20 @zfinzi Test run of @dima.sazhin work. Please check

https://app.nterminal.com/en-US/app/NterminalApp/search?q=search%20index%3Dagents&display.page.search.mode=smart&dispatch.sample_ratio=1&workload_pool=&earliest=1566831600&latest=1566835200&sid=1566834213.393960

In GitLab by @ngans20 on Aug 26, 2019, 18:48

The sourcetype=agent events look great to me!

I will defer to @zfinzi for sourcetype=relation - but the structure looks fine to me. A couple questions associated with the content.type field:

Is the plan to regularly pull from all monitored accounts with all following relations or only when there is an update?
- In an example where person1 used to follow person2 and just un-followed them: to get the current status, will there be a new event with the value for content.type ="un-followed" or "not following", or will there just stop being events with this "following" relation?
Zach - Are there any other types of relations besides following that could also be used in these events?

In GitLab by @zfinzi on Aug 26, 2019, 22:27

So far the event structure looks good to me.

@dima.sazhin Are you planning on also generating events from blog (publication) profiles as well? It would be good to know which medium agents are editors for relevant medium blogs.

Zach - Are there any other types of relations besides following that could also be used in these events?

In terms of events related to two medium accounts, I can only think of the content.type as following. If we have data on blogs or publications then there could be an additional content.type such as editor or subscriber connecting a publication and a medium account.

In GitLab by @zfinzi on Aug 27, 2019, 04:33

I quickly threw the event schema for the this together to standardize across all events storing follower data. Can be found here.

Events of sourcetype=relation can be differentiated by the information_source object.

In GitLab by @zfinzi on Aug 27, 2019, 04:35

@dima.sazhin one quick clarification. In the sourcetype=relation event, the content.source.username is following the content.target.username. Is this correct?

In GitLab by @zfinzi on Aug 27, 2019, 04:59

@dima.sazhin can you restructure your sourcetype=agent events to follow the agent standard listed in Stoplight. You will need to follow the ABM Agent schema.

You do not need to include fields that do not apply to your source

Example: you will not have a revenues field and therefore it does not need to be added.

For fields that do overlap with the schema you will need to rename them according to the ABM Agent structure

Example: authorTags will need to be called tags
Also username can go under aliases
Make sure to add the field agent_type with a uniform value of person. If you are adding blogs/publications they will also need an agent-type but set to organisation is fine for now.

For all fields that are specific to Medium I have created a specific object called ABM Medium Account which will be passed into the agent_specific_attributes field.

Important note: the field you have as id needs to go under medium_id
Also it is fine to have username stored in aliases and under agent_specific_attributes.username

Sorry for not providing this formatting sooner, let me know if you have any questions.

In GitLab by @anshlykov on Aug 30, 2019, 16:41

At the moment we pull all following relations. If you need changes, we can implement it in the future, just describe what you need somewhere. For example, create a discussion somewhere in Yupana

In GitLab by @anshlykov on Aug 30, 2019, 16:42

Correct.

In GitLab by @anshlykov on Aug 30, 2019, 16:59

@zfinzi

Why are you creating an additional level of data nesting? I think the object agent_specific_attributes only complicates the work with the data.
Why rename the id to medium_id?

https://gitlab.com/IncaSec/nterminal/cdc/sources/github-source/merge_requests/30#note_210165294

You can create id as you have done so in the past, the concatenation of information_source.name and full_name.

If you want the id field to be calculated this way, we will have a few problems.

People can change their full name on their profile
Two different accounts can have the same full name

In GitLab by @zfinzi on Aug 30, 2019, 17:13

Why are you creating an additional level of data nesting? I think the object agent_specific_attributes only complicates the work with the data.

To apply a standard for agent data. It makes it far easier for documenting and comparing events across sources.

Why rename the id to medium_id?

id and medium_id are different but could have the same names if medium_id is nested in agent_specific_attributes.

They are different values as well, id is internally generated while medium_id is created by medium. Both fields should be in the event.

If you want the id field to be calculated this way, we will have a few problems.

I agree that this is a bad method for doing this, I referred to previous documentation on this one to give Yulia clarity because I had no alternative thought up yet.

In GitLab by @anshlykov on Aug 30, 2019, 17:47

You use data composition, and there is a second method - data inheritance. The documentation in swagger supports both methods. I don't fully know how you're going to use the data, but at this point, I'd rather inherit data. Maybe I'm wrong.

I propose these changes:

create a separate sourcetype in Splunk for each source. (agent-medium, agent-twitter, agent-whatever ... agent-yupana). All of these types are based on one common type
id is the user ID from the social media that we monitor.
in agent-yupana you will aggregate data from different sources as you want and will assign them an id unique to our system

In GitLab by @zfinzi on Aug 30, 2019, 18:11

I agree with this breakdown, and this was the original idea. One agent event would only contain a single agent_specific_attribute, such as medium, twitter, github.

Common or inherited fields are all that exist outside of agent_specific_attributes object, these include: industry, tags, full_name ect.

Then these can be brought together using the inherited values or aggregators such as Messari & ICOholder to create a unified agent-yupana event with all unique child fields.

If you think there is a better way to format this on Spotlight, I will rework the event structures.

In GitLab by @anshlykov on Jan 7, 2020, 16:10

closed

1712n / challenge