VIDA-NYU / domain_discovery_tool_deprecated

Seed acquisition tool to bootstrap focused crawlers
23 stars 8 forks source link

Queries Plot #70

Closed brittainhard closed 8 years ago

brittainhard commented 8 years ago

@yamsgithub this is worth a look.

yamsgithub commented 8 years ago

I made a few changes to how pages are queried. So now you can get pages without doing any projections.

Some observations and suggestions:

  1. Zoom should be added to the queries plot.
  2. Can the thickness of the line be made proportional to the number of pages common between the queries. Then hopefully there will be a gradient in the connecting lines between the query nodes.
  3. The sitename stats seems to be off. the number of pages shown for each site seems much less than expected.
  4. Similarly the ending stats are also off.
brittainhard commented 8 years ago

@yamsgithub about points 3 and 4, do you have some data to show this, or some output you are seeing but not expecting? Any info is appreciated.

yamsgithub commented 8 years ago

@brittainhard Well...I have the data in my elasticsearch. I can give it to you. But I am guessing you can reproduce it with your data too? Are your site stats adding up to what you have in your elasticsearch? How many documents are there in the index you are testing with?

brittainhard commented 8 years ago

@yamsgithub In my dataset I mostly only see .com. I think that might be the result of a few factors.

It could just be the fact that .com is a far more common domain name than .org or .net. Also, keep in mind that the statistics generated are dependent on the session info from the application itself. That is, if you only have 100 pages loaded from a particular query, you're only going to see info on those 100 pages (with the exception of the timeseries plot, which shows down-sampled data from all pages in the index). The sample size is small, so you're far more likely to see only .com endings.

This can be changed such that the statistics dashboard shows information on all the pages, rather than just those grabbed using the current session info.

I don't think there are problems with my application code such that you would only see .com regardless of the URL passed to the plot, but I will double check.

brittainhard commented 8 years ago

@yamsgithub I think this is ready to merge.

yamsgithub commented 8 years ago

@brittainhard OK. So the fact that the site and endings table/graph shows the stats for the selected pages on the visualization explains what I was seeing as a discrepency. But the idea was that the stats page allowed the user to see all the data in that domain and not just the selected subset viewed on the visualization. Maybe by default we can allow them to see what is selected in the visualization window. But they should definitely be allowed to see all the data in the domain.

Also it is confusing since the queries stats is for the whole domain and not just what is selected on the visualization page

brittainhard commented 8 years ago

@yamsgithub if you would like me to expose the statistics for the entire domain, rather than the statistics based on the session data, I can do that. Just let me know.

That would be a separate PR though, I think.

yamsgithub commented 8 years ago

@brittainhard Yes...we should allow viewing the stats of the whole domain as this would be required by the use to see what they have explored so far.

Another PR is fine

brittainhard commented 8 years ago

@yamsgithub can we get this merged today?