1712n / challenge

Challenge Program
64 stars 26 forks source link

Medium Source Integration - $250 #27

Closed evgenydmitriev closed 3 years ago

evgenydmitriev commented 3 years ago

In GitLab by @ngans20 on Jul 24, 2019, 20:31

Bounty Description

Important Information/resources

Functionality

Agent Metadata

General rules

Background

Inca often uses "bounty projects" as introductory projects to vet potential employees or interns. These projects give interested individuals a chance to prove themselves, learn a bit about our company & products, and produce a useful result in the process. These projects are extremely independent and will require you to manage your own time and work process.

NTerminal is a data aggregation and analytics platform used for navigating the crypto-financial ecosystem. NTerminal's many data streams can be categorized into three general segments:

  1. Financial data - Trade and order book data from exchanges and aggregation entities (price, size, trade-pair, volumes, etc.)
  2. Natural Language data - Text-based data streams with keyword & sentiment analysis (social media posts, news articles, regulatory meeting minutes, etc.)
  3. Technical data from Blockchains, mining pools, code fuzzing, github repositories, etc.

Resources

Don't hesitate to ask us questions by commenting in this issue or emailing us at bounty@incasec.com.

evgenydmitriev commented 3 years ago

In GitLab by @ngans20 on Jul 24, 2019, 20:31

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @ngans20 on Jul 24, 2019, 20:34

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @anshlykov on Jul 24, 2019, 22:37

@ngans20

  • Any Top Writer badges

Can you give an example?

NLP events (to go through our nlp module)

  • Publications (with number of claps and number of comments)
  • comments

We cannot integrate this data through the new NLP module right now. I think we can do it in another task later.

evgenydmitriev commented 3 years ago

In GitLab by @ngans20 on Jul 24, 2019, 22:42

  • Any Top Writer badges

@zfinzi told me to include this. Zach can you share thoughts?

We cannot integrate this data through the new NLP module right now. I think we can do it in another task later.

Okay, sounds good. I'll slim this one down to just Agent metadata then!

evgenydmitriev commented 3 years ago

In GitLab by @ngans20 on Jul 24, 2019, 22:42

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @ngans20 on Jul 24, 2019, 22:45

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @zfinzi on Jul 25, 2019, 14:23

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @zfinzi on Jul 25, 2019, 14:25

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @anshlykov on Jul 29, 2019, 10:48

Zach can you share thoughts?

@zfinzi

evgenydmitriev commented 3 years ago

In GitLab by @zfinzi on Jul 29, 2019, 13:22

@anshlykov @ngans20 sorry for the delayed response. It is a designation by Medium for the most influential writers within a given topic, should be found on a medium account page. Here is a description.

I don't think it is the most important data point, but I see no harm in integrating this data if it can easily be pulled.

evgenydmitriev commented 3 years ago

In GitLab by @anshlykov on Jul 29, 2019, 19:31

Now everything is clear, thank you. I will send this issue to my buddy.

evgenydmitriev commented 3 years ago

In GitLab by @anshlykov on Jul 29, 2019, 19:32

changed title from Medium Source Integration to Medium Source Integration{+ - $250+}

evgenydmitriev commented 3 years ago

In GitLab by @dima.sazhin on Aug 6, 2019, 13:21

@ngans20 @zfinzi @anshlykov

1.

Lists of accounts following & followed by

Do you mean the list of ids?

2.

Publications (that the user can edit)

Do you going to do it in another issue or not? https://gitlab.com/IncaOutsourcing/bounty/issues/33#note_195496593

  1. Do you need people from a blog or a blog? https://medium.com/bitstamp-blog
evgenydmitriev commented 3 years ago

In GitLab by @dima.sazhin on Aug 6, 2019, 13:35

assigned to @dima.sazhin

evgenydmitriev commented 3 years ago

In GitLab by @zfinzi on Aug 6, 2019, 18:12

@dima.sazhin Thank you for claiming this bounty! Here are some answers to your questions:

  1. We're looking for a list of usernames.

  2. For publications, (this is related to the 3rd question) it would be great to know if a given user is an editor or writer for a publication. The comment you referred to is in relation to article posts and comments (that will require a separate module).

  3. We need both the people from the blog and the blog, specifically it would be good to have information on the editors and writers of a blog (or publication).

evgenydmitriev commented 3 years ago

In GitLab by @evgenydmitriev on Aug 6, 2019, 18:58

To clarify

We're looking for a list of usernames.

Unique IDs that cannot be changed are the priority in any data source, display usernames are useful, but are less important. Not sure how it works in Medium though.

evgenydmitriev commented 3 years ago

In GitLab by @dima.sazhin on Aug 8, 2019, 14:20

I will collect ids if you don't mind.

I want to clarify one thing. Are you sure you want to store all relations to other objects in the same object?

evgenydmitriev commented 3 years ago

In GitLab by @ngans20 on Aug 8, 2019, 15:49

@zfinzi & @evgenydmitriev - you guys know better than me here

evgenydmitriev commented 3 years ago

In GitLab by @evgenydmitriev on Aug 8, 2019, 15:57

I will collect ids if you don't mind

Yes, IDs are the priority. We still need readable names as an extra field though, especially if the user can change them over time.

Are you sure you want to store all relations to other objects in the same object?

I don't see any other options. Those relationships describe the object and are unique to it. In terms of additional processing and potential splitting into other objects, this will be done by other modules down the CDC pipeline. Feel free to suggest other approaches though.

The main purpose of a source component is to periodically collect and normalize all external information so other modules don't need to talk to the outside world.

evgenydmitriev commented 3 years ago

In GitLab by @evgenydmitriev on Aug 12, 2019, 18:08

changed the description

evgenydmitriev commented 3 years ago

In GitLab by @evgenydmitriev on Aug 12, 2019, 18:10

I modified the requirements to make things easier:

Lists of accounts following (we don't need the followers)

Also, not a hard requirement, but I agree with @dima.sazhin that sending agent connections in separate messages is a more scalable way of doing things.

evgenydmitriev commented 3 years ago

In GitLab by @anshlykov on Aug 26, 2019, 18:05

@ngans20 @zfinzi Test run of @dima.sazhin work. Please check

https://app.nterminal.com/en-US/app/NterminalApp/search?q=search%20index%3Dagents&display.page.search.mode=smart&dispatch.sample_ratio=1&workload_pool=&earliest=1566831600&latest=1566835200&sid=1566834213.393960

evgenydmitriev commented 3 years ago

In GitLab by @ngans20 on Aug 26, 2019, 18:48

The sourcetype=agent events look great to me!

I will defer to @zfinzi for sourcetype=relation - but the structure looks fine to me. A couple questions associated with the content.type field:

evgenydmitriev commented 3 years ago

In GitLab by @zfinzi on Aug 26, 2019, 22:27

So far the event structure looks good to me.

@dima.sazhin Are you planning on also generating events from blog (publication) profiles as well? It would be good to know which medium agents are editors for relevant medium blogs.

Zach - Are there any other types of relations besides following that could also be used in these events?

In terms of events related to two medium accounts, I can only think of the content.type as following. If we have data on blogs or publications then there could be an additional content.type such as editor or subscriber connecting a publication and a medium account.

evgenydmitriev commented 3 years ago

In GitLab by @zfinzi on Aug 27, 2019, 04:33

I quickly threw the event schema for the this together to standardize across all events storing follower data. Can be found here.

Events of sourcetype=relation can be differentiated by the information_source object.

evgenydmitriev commented 3 years ago

In GitLab by @zfinzi on Aug 27, 2019, 04:35

@dima.sazhin one quick clarification. In the sourcetype=relation event, the content.source.username is following the content.target.username. Is this correct?

evgenydmitriev commented 3 years ago

In GitLab by @zfinzi on Aug 27, 2019, 04:59

@dima.sazhin can you restructure your sourcetype=agent events to follow the agent standard listed in Stoplight. You will need to follow the ABM Agent schema.

You do not need to include fields that do not apply to your source

For fields that do overlap with the schema you will need to rename them according to the ABM Agent structure

For all fields that are specific to Medium I have created a specific object called ABM Medium Account which will be passed into the agent_specific_attributes field.

Sorry for not providing this formatting sooner, let me know if you have any questions.

evgenydmitriev commented 3 years ago

In GitLab by @anshlykov on Aug 30, 2019, 16:41

At the moment we pull all following relations. If you need changes, we can implement it in the future, just describe what you need somewhere. For example, create a discussion somewhere in Yupana

evgenydmitriev commented 3 years ago

In GitLab by @anshlykov on Aug 30, 2019, 16:42

Correct.

evgenydmitriev commented 3 years ago

In GitLab by @anshlykov on Aug 30, 2019, 16:59

@zfinzi

  1. Why are you creating an additional level of data nesting? I think the object agent_specific_attributes only complicates the work with the data.

  2. Why rename the id to medium_id?

https://gitlab.com/IncaSec/nterminal/cdc/sources/github-source/merge_requests/30#note_210165294

You can create id as you have done so in the past, the concatenation of information_source.name and full_name.

If you want the id field to be calculated this way, we will have a few problems.

evgenydmitriev commented 3 years ago

In GitLab by @zfinzi on Aug 30, 2019, 17:13

Why are you creating an additional level of data nesting? I think the object agent_specific_attributes only complicates the work with the data.

To apply a standard for agent data. It makes it far easier for documenting and comparing events across sources.

Why rename the id to medium_id?

id and medium_id are different but could have the same names if medium_id is nested in agent_specific_attributes.

They are different values as well, id is internally generated while medium_id is created by medium. Both fields should be in the event.

If you want the id field to be calculated this way, we will have a few problems.

I agree that this is a bad method for doing this, I referred to previous documentation on this one to give Yulia clarity because I had no alternative thought up yet.

evgenydmitriev commented 3 years ago

In GitLab by @anshlykov on Aug 30, 2019, 17:47

You use data composition, and there is a second method - data inheritance. The documentation in swagger supports both methods. I don't fully know how you're going to use the data, but at this point, I'd rather inherit data. Maybe I'm wrong.

I propose these changes:

evgenydmitriev commented 3 years ago

In GitLab by @zfinzi on Aug 30, 2019, 18:11

I agree with this breakdown, and this was the original idea. One agent event would only contain a single agent_specific_attribute, such as medium, twitter, github.

Common or inherited fields are all that exist outside of agent_specific_attributes object, these include: industry, tags, full_name ect.

Then these can be brought together using the inherited values or aggregators such as Messari & ICOholder to create a unified agent-yupana event with all unique child fields.

If you think there is a better way to format this on Spotlight, I will rework the event structures.

evgenydmitriev commented 3 years ago

In GitLab by @anshlykov on Jan 7, 2020, 16:10

closed