duckduckgo / p5-app-duckpan

DuckDuckHack OpenSource Development Application
http://metacpan.org/module/App::DuckPAN
Other
54 stars 47 forks source link

Add Fathead support for DuckPAN Query & Server #358

Closed moollaza closed 8 years ago

moollaza commented 8 years ago

/cc @zachthompson @jbarrett @jdorweiler

These changes add support to duckpan query and duckpan server for Fathead features used by open source Fatheads.

You can query the exact names of articles in the Output.txt file to see the result that will display. If an article is a redirect, it will follow the redirect and show the correct result. Disambiguations and Abstract Images are also supported.

Category Pages and External Links (infobox) support has not been implemented yet. But I plan to implement the latter soon. Also Fathead Triggers are not yet supported but I will add that soon as well, using the IA Page triggers. However, internally I don't believe we consider the triggers defined on the IA page (though we should!)

This approach ignores the DDG infrastructure and request handlers, which Spice and Goodies use. That system is heavily dependant on loading Fathead Perl modules into Blocks, but we removed the Perl modules long ago as they were only there to contain Fathead Metadata.

One other missing feature is the internal processing of the output.txt -- Internally we process each line of the output.txt and generate several variations of the Article title, e.g. for "background-color" we also create "background color" and "backgroundcolor".

I noticed that some Fathead parsing scripts implement this functionality themselves. We should probably make it known to developers that we'll do this work for them to simplify the parsers a bit.

Any feedback is welcome. I've not used DBI before so I wasn't sure about the best way to get the output.txt into a DB for searching. I've used DBD::CSV which interprets the output.txt (tab delimited) as a CSV and pulls it in to memory for SQL querying. I'm not sure how well this will work for very large output.txt files. We can easily test this though.

How to Test

  1. git checkout this branch
  2. Clone the Fathead repository: git clone git@github.com:duckduckgo/zeroclickinfo-fathead.git
  3. Start the server: $ duckpan -I ../p5-app-duckpan/lib/ server <instant_answer_id> e.g. $ duckpan -I ../p5-app-duckpan/lib/ server mdn_css
  4. Enter one of the article titles from the associated Output.txt as a search query to see the content displayed.

To view the changes made while ignoring whitespace changes: https://github.com/duckduckgo/p5-app-duckpan/pull/358/files?w=0

zachthompson commented 8 years ago

I'm not sure I understand the rationale of this being in duckpan. You can pull out the keys, or just dump them in a debug statement in your fathead script, and you know everything it will trigger on exactly.

moollaza commented 8 years ago

I'm not sure I understand the rationale of this being in duckpan. You can pull out the keys, or just dump them in a debug statement in your fathead script, and you know everything it will trigger on exactly.

@zachthompson I'm more interested in getting it to work in DuckPAN Server so developers can actually see what the results will look like. I can look at the output.txt to get an idea of what content I'm showing, but being able to see the result is a big bonus and means you don't have to wait for us to deploy it to Beta to see it working. In fact, you don't even really need to write any code if you just manually create an output.txt and load it into DuckPAN for testing. If we add support for more Fathead templates this will become even more useful.

I agree that the duckPAN query integration doesn't offer as much benefit, but there's still something I enjoy about working with a REPL to see that the queries I'm thinking about trigger and return results. IMO looking at the data in aggregate can be a little overwhelming.

GuiltyDolphin commented 8 years ago

@zachthompson I agree with @moollaza on duckpan server - many developers may wish to observe the Instant Answer in an environment similar to what they would see on production; and Beta has a huge reliance on deploys. If we want to increase the number of developers working on FatHead Instant Answers I think this is a good move.

jbarrett commented 8 years ago

Is this ready to start experimenting with? How can we help you test it?

I think there are a bunch of proposed fatheads in the pipeline we can start using as test load :)

GuiltyDolphin commented 8 years ago

@moollaza Could we use some of the information from the IA metadata (and ID) to generate some of the paths?

For example, rather than having to specify duckpan server -o lib/fathead/perl6_doc/output.txt you would just say duckpan server Perl6Doc or duckpan server perl6_doc and it would figure the rest out. Admitedly FatHeads are a bit of an edge-case with a different directory structure to Goodies and Spice, but this should be easily do-able.

Edit You mention not binding FatHeads to metadata - but we do have metadata for FatHeads (DDG::Meta::Data->filter_ias({repo => 'fathead'})) regardless of whether or not they have a Perl module; why would this be an issue?

moollaza commented 8 years ago

Is this ready to start experimenting with?

@jbarrett Yes! I've been using for the past week or so. At a very basic level it's working, but without the connection to Metadata it's not as good as it could be.

How can we help you test it?

Try testing any open Fathead PR's with this to see what works and what doesn't. I don't love specifying the output.txt path and would like to improve this

Could we use some of the information from the IA metadata (and ID) to generate some of the paths?

@GuiltyDolphin that's what I'd like to do, and your suggestion is exactly what I had in mind. It just feels weird specifying the name of a Perl Module which doesn't actually exist. IMO we need a better way to connect Fathead files in the repo to their metadata, because we're relying on the name of an imaginary Perl module which doesn't exist in the repo or anywhere else.

We should probably change "Perl Module" on the IA Page to "Directory" or similar which maps to a dir in https://github.com/duckduckgo/zeroclickinfo-fathead/tree/master/lib/fathead -- that's something real that the developer has knowledge of

@zachthompson @jbarrett @jdorweiler @GuiltyDolphin what are your thoughts on that?

moollaza commented 8 years ago

I just realized we assume the /lib/fathead/ dirs are named according to the Fathead IDs. That should make things easier 👍

moollaza commented 8 years ago

I've now added support for Metadata and remove the -o option.

Now you just specify the Fathead ID as the last argument:

e.g. duckpan server mdn_css

We're operating on the assumption that a dir exists in /lib/fathead/ with the same name which means we can easily check for the existence of an output.txt file in the dir

GuiltyDolphin commented 8 years ago

@moollaza This appears to be triggering on everything now? If I try foo with the perl_doc FatHead it returns a result with all the meta information but no text (as there is no field for foo).

moollaza commented 8 years ago

This appears to be triggering on everything now?

Yup, there was a bug. Should be all fixed now 👍

moollaza commented 8 years ago

I think this is pretty good for now. It does not yet support Category Pages, but AFAICT no open source Instant Answers use Category pages.

I haven't added support for External Links (infobox) yet either, but will once the mdn_css Fathead starts using them. I don't think any other IAs currently use the external links.

moollaza commented 8 years ago

Added a few more commits and updated the PR description

GuiltyDolphin commented 8 years ago

@moollaza Getting a Can't locate object method "_get_image" via package "App::DuckPAN::FatHead" .../lib/App/DuckPAN/FatHead.pm line 173. in duckpan query.

moollaza commented 8 years ago

@GuiltyDolphin whoops, didn't mean to change that function's name yet. Fixed

jagtalon commented 8 years ago

Looks like the URL is not showing in the "More at ..." even thought it can read it in DuckPAN:

screen shot 2016-07-27 at 3 03 07 pm screen shot 2016-07-27 at 3 03 16 pm
jbarrett commented 8 years ago

For some reason I was working off the impression this had been released and opened an issue: #365 - Once you've gotten a match, that match will always display until the next match.

jagtalon commented 8 years ago

I thought it had been released too :)) It took me an embarrassingly long time to figure that out.

GuiltyDolphin commented 8 years ago

@moollaza No support for related topics? Any chance we could get support for this soonish?

moollaza commented 8 years ago

No support for related topics? Any chance we could get support for this soonish?

@GuiltyDolphin I don't believe any existing IAs use it so I didn't have anything to test with. I was going to add support once the in-progress MDN CSS Fathead started using them 😉

moollaza commented 8 years ago

I thought it had been released too :)) It took me an embarrassingly long time to figure that out.

If everyone's already using this, then it sounds like there aren't any major issues with it?

@zachthompson care to merge? Happy to keep refining this as necessary.

GuiltyDolphin commented 8 years ago

@moollaza Okay, well it's not a killer atm, but we're getting a fair few FatHeads now which may wish to use related topics, so it should definitely be one of (if not the) next features to be implemented.

Otherwise, this looks good to merge - I've been using it a fair bit and it seems to work pretty well.

moollaza commented 8 years ago

Merging with @zachthompson's approval 👍

jagtalon commented 8 years ago

Woo!