Develop Google Trends Source for Spring Cloud Dataflow streams - $200 #6

Closed evgenydmitriev closed 3 years ago

In GitLab by @zfinzi on Mar 17, 2019, 23:04

Develop a googletrendstream source component, which can be easily integrated into Spring Cloud Dataflow streams.

Toolset: JAVA, Spring-dataflow, docker

If you want to lock this issue to make sure no one else is working on it, please comment below and send us your resume at careers@incasec.com. After your resume review, we'll add "in progress" tag and assign the issue to you. Upon request, we can also create an escrow job on one of the freelancer websites (Upwork, fl.ru, etc). All of this is optional - you can skip this step if you just want to show us the result.
You need to create a separate personal git project and provide the issue creator with access to your repository for code review.
Upon completion of the project, please add "release" branch, create merge request from master to release, assign the issue creator to it, and leave a comment here.
After resolving all our comments associated with the merge request, we'll release the payment, and move the project into our repository.

Component should

be dockerized.
communicate with a stream via RabbitMQ-broker.
be configured through spring-dataflow-dashboard GUI to receive initial input * data: token, comma-separated list of channels.
yields new messages from those channels to self-output in SCHEMA format.
tests should be provided.
please, ask us about not-so-popular 3rd party libs, before using.

following commands should be provided:

a command for building docker-image with the component; a command for building application-metadata jar package companion.

Definition of done

all review discussions are closed.
docker-image with the component and the metadata-package are built successfully.

In GitLab by @zfinzi on Mar 17, 2019, 23:05

changed the description

In GitLab by @zfinzi on Mar 17, 2019, 23:11

changed title from Develop Google Trends Source for Spring Cloud Dataflow streams{- - $500-} to Develop Google Trends Source for Spring Cloud Dataflow streams

In GitLab by @zfinzi on Mar 17, 2019, 23:11

changed title from Develop Google Trends Source for Spring Cloud Dataflow streams to Develop Google Trends Source for Spring Cloud Dataflow streams{+ - Intern Project+}

In GitLab by @ghost1 on Mar 19, 2019, 04:07

Working On this project

In GitLab by @evgenydmitriev on May 6, 2019, 23:46

changed title from Develop Google Trends Source for Spring Cloud Dataflow streams - {-Intern Project-} to Develop Google Trends Source for Spring Cloud Dataflow streams - {+$200+}

In GitLab by @durm on Jun 13, 2019, 13:19

@zfinzi yields new messages from those channels to self-output in SCHEMA format..

where is the SCHEMA for output messages?

In GitLab by @zfinzi on Jun 13, 2019, 16:24

I'll work with @juanmigutierrez to build a schema for the data. My mistake for not having on to begin with.

In GitLab by @MarioIshac on Sep 18, 2019, 05:31

Working on this, starting from scratch like suggested here.

In GitLab by @MarioIshac on Sep 18, 2019, 05:32

assigned to @MarioIshac

In GitLab by @MarioIshac on Sep 23, 2019, 24:51

Hi Zach, did a schema ever get built for this source? If not, we can start work on building one (I've got the JSON data that Google Trends backs their graphs with per request, so it should be relatively easy given we know what they provide through their API).

In GitLab by @zfinzi on Sep 23, 2019, 04:25

We had devised a schema with a former individual tackling the bounty issue but they deleted their GitLab project so I can no longer see the thread. That is my mistake for not backing those files up.

Initially we decided that the Interest Over Time and Interesst by Subregion panels were of interest to us with modifications to the Region selection and Time Selector.

Prupose a suitable schema or set of schemas for this data and I will provide feedback.

Also make sure to tag people in your comments like so : @MarioIshac - this makes it easier for me to see your questions.

In GitLab by @MarioIshac on Sep 26, 2019, 08:29

@zfinzi

Here is the format of the data returned from Google Trends for Interest over Time:

{
  "default": {
    "timelineData": [
      {
        "time": "1411862400",
        "formattedTime": "Sep 28 – Oct 4, 2014",
        "formattedAxisTime": "Sep 28, 2014",
        "value": [
          64,
          1
        ],
        "hasData": [
          true,
          true
        ],
        "formattedValue": [
          "64",
          "1"
        ]
      },
      ...
    ],
    "averages": [
      78,
      2
    ]
  }
}

The above is for a comparison between two keywords, timelineData has a bunch of these percent comparisons for each time value on it. If we queried for one keyword (let's say we wanted to compare a keyword's popularity to itself along the timeline rather than against another keyword), value, hasData, formattedValue and averages would all be 1 element.

Here is the format for Interest over Region (with the region selection being "COUNTRY" as an example)

{
  "default": {
    "geoMapData": [
      {
        "geoCode": "US",
        "geoName": "United States",
        "value": [
          98,
          2
        ],
        "formattedValue": [
          "98%",
          "2%"
        ],
        "maxValueIndex": 0,
        "hasData": [
          true,
          true
        ]
      },
      ...
    ]
  }
}

I think a schema would depend on whether this source would be used to compare a keyword to itself along the timeline/regions (in which case the schema would have a one-dimensional array containing the data), or against another other keywords (in which case a two-dimensional array would store data, the axes being time/region and keyword index in some order). Average would be a value in the former case, or an array of values in the latter with length equal to the number of keywords being compared against each other.

We could also do a generic solution where if the data for one keyword is wanted (presumably to compare it against itself), the schema is the same as if the data for multiple keywords was wanted. In the one-keyword case, all the arrays would be of size 1.

Thoughts on which approach would work best?

In GitLab by @zfinzi on Sep 27, 2019, 11:47

@MarioIshac great work with the formatting, here are my thoughts:

I believe that it will be most useful to do comparisons of a keyword to itself therefore making the arrays of size 1.
Since we want a comparison of the keyword to itself, perhaps the fields should be of a field type string or int rather than arrays.
For the field formattedTime - this should be split into a startTime and endTime

I have some questions about some of the other fields:

How are you handling storage of multiple data points for the Google Trends for Interest over Time - will this be in one event or multiple events for each data point over a period of time?
Will formattedAxisTime be related to the value in the value field? Also you should change the name of this field to be more representative of the metric.
Why is there a value and formattedValue field? This seems redundant.

In GitLab by @MarioIshac on Sep 30, 2019, 06:38

@zfinzi

For question 1, I'll send all data points over a queried period of time as one event because each value of the datapoints for the period of time returned depends on the other datapoints. It's all relative: A value of 50 at a certain time would mean that a term had half its max value (popularity) over the period of time at that time. Sending as one event ensures the relevant context is there.

For questions 2 and 3, Google Trends sends all this data to construct the widget displayed in the browser, so I would remove the data used for the presentation layer (like formattedValue and formattedAxisTime).

I still might have to fine tune the schema, but so far for the time series data it will sent as an array of instances with:

startTime (date as str)
endTime (date as str)
value (int from 0 - 100)
hasData (boolean: if false, value will be 0)

As for geomap data, it will sent as an array of instances with:

geoCode (str)
geoName (str)
value
hasData

In GitLab by @MarioIshac on Sep 30, 2019, 07:17

Project is here: https://gitlab.com/MarioIshac/googletrendstream, anticipating completion by Tuesday.

In GitLab by @zfinzi on Sep 30, 2019, 19:29

Thank you for answering my questions, things are more clear now.

I still do not understand what you mean by:

the time series data it will sent as an array of instances with:

Otherwise I agree with the format of the data event. Fine tuning can come when you have example sets to test.

In GitLab by @MarioIshac on Oct 3, 2019, 08:08

@zfinzi

Just an update that I'm still trying to figure out how to structure the "outer" part of the schema (by outer I mean the JSON nodes wrapping the direct parents of each value). I'll be able to soon clarify what I mean by "the time series data will [be] sent as an array of instances" once I have a solid idea on how to do the outer part.

Alongside that, I've almost finished allowing configuration of the region and time selectors, but still need to make the time selector configuration more flexible (currently data can be collected down to every day in a specific time range, but I can make that down to every hour in that specific time range with a bit more work). I'll also have an update on this soon.

In GitLab by @ngans20 on Oct 3, 2019, 10:20

In my opinion, the granularity for this source doesn’t actually need to see the hourly. It might be nice to have, but mostly I imagine this source to be used for macro comparative analysis.

In GitLab by @MarioIshac on Oct 5, 2019, 23:58

@zfinzi @ngans20

Here's an example to clarify what I mean by how the time series data will be sent:

{
  "timeData": [
    {
      "startTime": "2019-01-01",
      "endTime": "2019-01-01",
      "value": 100,
      "hasData": true
    },
    {
      "startTime": "2019-01-02",
      "endTime": "2019-01-02",
      "value": 58,
      "hasData": true
    },
    {
      "startTime": "2019-01-03",
      "endTime": "2019-01-03",
      "value": 53,
      "hasData": true
    },
    {
      "startTime": "2019-01-04",
      "endTime": "2019-01-04",
      "value": 67,
      "hasData": true
    },
    {
      "startTime": "2019-01-05",
      "endTime": "2019-01-05",
      "value": 83,
      "hasData": true
    },
    {
      "startTime": "2019-01-06",
      "endTime": "2019-01-06",
      "value": 78,
      "hasData": true
    },
    {
      "startTime": "2019-01-07",
      "endTime": "2019-01-07",
      "value": 47,
      "hasData": true
    }
  ]
}

The above data would represent an event the source generates for data corresponding to a query from 2019-01-01 to 2014-01-07 for a certain keyword, with granularity being daily. If the granularity was configured to not be daily, then the startTime and endTime for each datapoint would differ.

Final thing to do, so far, before the project is done is structuring when and what the source queries,

The problem is that we can't just query for the current date because Google Trends returns values relative to the max of a time range queried. If the time range consists of one day (the current date), then the value for that day will always be 100. In order to merge two time ranges into one continuous time range, the time ranges need to have atleast one datapoint at the same time in common. (For example, it would be impossible to merge data returned for 9/21 -> 9/23 and 9/24 -> 9/26, but it would be possible for 9/21 -> 9/23 and 9/23 -> 9/26 because they share 9/23 in common).

This isn't a problem if we query past time ranges because we could query the whole time range at once. But for time ranges whose datapoints exist in the future, we would have to merge the parts of the time range we could get now with the parts we would get in the future, requiring a merge at the end in order to be able to compare datapoints of different days.

This means that for a certain day in the future, we couldn't just query data for that day. We would have to query for a time range that includes that day (to get the relevant datapoint) but also another datapoint that overlaps with previously queried datapoints.

Assuming the granularity is daily, an option I have in mind is on a certain day, the source would query for a time range consisting of the day before and that certain day. This would allow us to continuously merge new datapoints into the old series of datapoints accumulated.

Thoughts?

In GitLab by @ngans20 on Oct 6, 2019, 21:24

In order to merge two time ranges into one continuous time range, the time ranges need to have atleast one datapoint at the same time in common. (For example, it would be impossible to merge data returned for 9/21 -> 9/23 and 9/24 -> 9/26, but it would be possible for 9/21 -> 9/23 and 9/23 -> 9/26 because they share 9/23 in common).

Honestly I dont understand what you mean by merge the data. Because the values returned are only a relative percentile for the selected range I dont know how "merging" would work. Here's an example:

9/1	9/2	9/3	9/4	9/5
1	0	0	2	0

If a term had one search on 9/1, two on 9/4, and none on all other days, merging results from 9/1->9/3 & 9/3->9/5 would incorrectly value the one and two searches as equally weighted.

The search 9/1->9/3 would yield 100 | 0 | 0
The search 9/3->9/5 would yield 0 | 100 | 0
The merge of these two would yield 100 | 0 | 0 | 100 | 0
The actual search 9/1->9/5 would list ~ 50 | 0 | 0 | 100 | 0

In GitLab by @ngans20 on Oct 6, 2019, 21:29

Since all of the data is relative, we might just want regularly scheduled over like the past 7 days and past 30 days and then have a periodically updated inputlookup file for 2008/2009 to today? @zfinzi what do you think

In GitLab by @MarioIshac on Oct 7, 2019, 24:01

@ngans20

I think the common datapoint between the two time ranges to be merged have to have a non-zero value. Take these two sample searches:

9/1 -> 9/3: 100 | 0 | 50

9/3 -> 9/5: 25 | 100 | 0

Because the value of 9/3 in 9/1 -> 9/3 is twice the value of 9/3 in the 9/3 -> 9/5, you know the absolute max of 9/1 -> 9/3 is half the the absolute max of 9/3 -> 9/5. Knowing that, you can halve every value in 9/1 -> 9/3 in order to get the relative percentiles compared to the absolute max of 9/3 -> 9/5, yielding the following:

9/1 -> 9/5: 50 | 0 | 25 | 100 | 0

Another example:

9/1 -> 9/3: 100 | 95 | 10

9/3 -> 9/5: 40 | 88 | 100

In this case, because the value of 9/3 in 9/1 -> 9/3 is a fourth of the value of 9/3 in 9/3 -> 9/5, you know the absolute max of the 9/1 -> 9/3 is four times the absolute max of 9/3 -> 9/5, requiring each value in 9/3 -> 9/5 to be divided by four:

9/3 -> 9/5: 100 | 95 | 10 | 22 | 25

If 9/3 had a value of 0 here, when attempting to find the ratio between the absolute max of 9/1 -> 9/3 to 9/3 -> 9/5 you would get (0)(Absolute_Max_Of_First_Time_Range) = (0)(Absolute_Max_Of_Second_Time_Range), making the ratio impossible to find and the merge impossible.

I might be completely flunking the logic here, but this is the thought process I originally had when thinking about how to merge.

I'm not sure about what to do in the event that the common datapoint we request for does end up having a value of 0. Especially if we're getting small time ranges at a time, a non-partial datapoint with value 0 is pretty rare, but if it does happen: All we need is one non-zero datapoint in common in order to find that ratio between the absolute maxes, so we could send a request for a time range with another common datapoint that is non-zero (and do the merging from there).

For example, let's say we want to append the day 10/7 into our current timeline from some date up to 10/6. We can't request for 10/7 by itself, so we could do the following: Request from the latest date with a non-zero value up to 10/7, find the ratio above, and scale 10/7 as appropriate. If 10/6 had a value of zero, we could request from 10/5 -> 10/7, find the ratio between the values of 10/5 on the timerange we already have and the one we requested, then scale 10/7 as appropriate and append it to the existing timeline.

In GitLab by @zfinzi on Oct 7, 2019, 05:06

I'm wondering what the use is of merging this data when a specified time range in google trends provides us with all the contextual information we need. We are never going to get the raw data points from this source so our use is limited to a relative representation anyway.

It would be best not to try and derive anything from the data when we can get this raw information from other sources. @MarioIshac you have thought through this in quite depth but I don't think we need to over complicate things.

I'm wondering what the necessity is for historical google trends data, perhaps if we are looking for web traffic data we can better focus our time on other sources (my web traffic module I neglected to complete).

If the only goal for this source is to provide the relative traffic format for our front end clients, then I say we go with a html widgets embedded in Dashboards, such as in this example.

I apologize @MarioIshac if this derails your work on this source. Before we make any shifts, I am curious as to what you and @ngans20 think. Perhaps either of you have a reason for pulling and storing these data points. Let me know your thoughts.

In GitLab by @ngans20 on Oct 8, 2019, 21:08

Yeah I agree. We dont need to derive anything. Also this is something that could be implemented down the line, or even on the fly in splunk.

In GitLab by @zfinzi on Oct 9, 2019, 01:40

@MarioIshac what are your thoughts here?

If you agree - then I suggest we change pace and set do outs for the web traffic module. I apologize for the lane switch here, but I think your work brought up some important points in regard to this source.

I was previously working on the web traffic module so I can give you a overview of what I started with and we can go from there.

In GitLab by @MarioIshac on Oct 15, 2019, 24:51

mentioned in merge request MarioIshac/googletrendstream!1

1712n / challenge

Develop Google Trends Source for Spring Cloud Dataflow streams - $200 #6