VIDA-NYU / domain_discovery_tool_deprecated

Seed acquisition tool to bootstrap focused crawlers
23 stars 8 forks source link

WIP Canavandl/domain xfilter #76

Closed canavandl closed 8 years ago

canavandl commented 8 years ago


Opening PR for review w/ @brittainhard 
canavandl commented 8 years ago

Cross filter is accessable at:

http://localhost:8084/cross_filter?session=%session args

I usually open the Page Statistics link then hand-change the url to cross_filter

canavandl commented 8 years ago

Now available through the "Page Statistics" button and at /statistics?session=... (whatever the session args are)

canavandl commented 8 years ago

ping @yamsgithub for review

I've added server-side caching of the elasticsearch queries, so the interactivity is close to a reasonable speed.

note: I added a functools32 (backport of functools for python2) dependency in order to use lru_cache. You'll have to update your env accordingly (it's in the environment.yml file).

todo:

canavandl commented 8 years ago

current status:

screen shot 2016-04-15 at 1 09 26 pm
canavandl commented 8 years ago

current status:

screen shot 2016-04-19 at 12 21 46 pm

canavandl commented 8 years ago

ping @yamsgithub for review

current status:

screen shot 2016-04-27 at 3 56 32 pm
brittainhard commented 8 years ago

@yamsgithub please take a look at this again.

yamsgithub commented 8 years ago

I see that you are using a call to read the whole index. This will not scale. You should be using elasticsearch aggregations.

def get_plotting_data(index_name, es=None):
    if es is None:
        es = default_es

    res = es.search(index_name, size=100000, fields=["retrieved", "url", "tag", "query"])

    fields = []
    for item in res['hits']['hits']:
        if item['fields'].get('tag') != None:
            if item['fields']['tag'][0] == '':
                item['fields'].pop('tag')
        fields.append(item['fields'])    

    return fields
yamsgithub commented 8 years ago

I can now see the graphs. But there is still no zoom.

canavandl commented 8 years ago

@yamsgithub - the plots don't currently have any zoom tool activated because I didn't think it added much to the visualization. I can add them though - you want the box zoom or wheel zoom? Any other interactions?

yamsgithub commented 8 years ago

It would when the number of queries and tags are large.

The wheel zoom would be more appropriate here.

canavandl commented 8 years ago

Added wheel_zoom and reset button:

screen shot 2016-05-25 at 12 24 39 pm

The buttons are kind of ugly, but there's not a lot that can be done. Alternative, you could remove reset then have the wheelzoom button hidden but always on. Also, you want the pan tool (click and drag on the plot to move view window)?

yamsgithub commented 8 years ago

Yeah...pan tool would be useful. This would also be useful on the page clustering window.

The plot of pages downloaded over time does not have the actual date on the plot.

yamsgithub commented 8 years ago

So the other things we mentioned:

Making the text bounding box transparent. Add 'Help' button that pops a text box where we can add all the instructions and features.

canavandl commented 8 years ago

@yamsgithub

Updates:

I wasn't able to reproduce your reset button issues in my Ubuntu VM on Chrome. If you pull the recent commits onto your branch and still have to issue let me know. Also please check your console to see if any helpful error messages are being logged.

The last change I've got to make is to add a callback onto the datetime picker widgets so that it fires on change like the tables. It's taking me a minute to figure it out, but I'll figure it out.

yamsgithub commented 8 years ago

@canavandl

I just tested the changes. Most are fine. Here are a few comments:

yamsgithub commented 8 years ago

I am also seeing this strange issue. So I make a new web query and then click on the page stats tab. I get the following error: error_stats

But if I restart the ddt server then I no longer see this error and all is working fine!

yamsgithub commented 8 years ago

OK...so I now see the date/time but the time is not local time.

canavandl commented 8 years ago

I moved the help hint into the nav bar on the far right and fixed the timeseries/local timestamp issue.

screen shot 2016-05-26 at 6 04 58 pm

todo:

canavandl commented 8 years ago

ping @yamsgithub

I believe I have resolved all of the issues/comments. Pls review when you have time.

yamsgithub commented 8 years ago

@canavandl

The following errors still exist:

  1. The connections are still incorrect. For example, in the queries plot attached (from the data I sent you) the query ratatouille should not be connected to any other node. error_queries And in tags network why are there connections between the tagged nodes and the untagged node? error_tags
  2. The template error has gone. But new queries do not seem to show on the network.
canavandl commented 8 years ago

@yamsgithub

I sent an email about this:

I think the issue is that we're checking if different query results have domains in common, not specific pages. So query_A which returns nytimes.com/news_article_A would be linked to query_B's result of nytimes.com/news_article_B.

Is it your desire for the links to be for specific pages and not only domains? (This is likely due to me not understanding the web scrapping domain very well) If so, it's a quick 2 line fix.

yamsgithub commented 8 years ago

@canavandl

With the latest changes I see no links between the nodes that definitely have pages in common. There are no links between any nodes!