dansinker / tacofancy

community-driven taco repo. stars stars stars.
The Unlicense
1.29k stars 448 forks source link

Any chance we can talk data standards? #61

Open evz opened 10 years ago

evz commented 10 years ago

I'm attempting a tacofancy API over here and quickly realized that all I can reliably grok from the rendered markdown of the recipe pages is the title of the recipe (since it ends up in an h1). I suppose this might be a great opportunity for me to hone my regex skills or I could just be satisfied with the name, URL and text of the recipe but wouldn't it be great to have a web service that you could query by ingredient, preparation time, etc?

Personally, all I want is a taco randomizer which probably doesn't need all that other stuff but if I'm going to put the effort into building this part, I may as well do it right. On the other hand, most of the beauty of this project lies in its simplicity so I'd hate to be the guy that ruins that (in its first week no less).

knowtheory commented 10 years ago

Hey @evz, this came up yesterday in a couple of issues.

I've hacked up a little coffeescript runner which scans the repo and generates an index of recipes by ingredient in my fork: https://github.com/knowtheory/tacofancy

The discussion around formatting came up here: https://github.com/sinker/tacofancy/issues/49

I'm planning on spending some time this weekend hacking together an entity extractor that'll pull apart the ingredient lines into a better structured format.

evz commented 10 years ago

@knowtheory Nice! Guess it pays to read the closed issues. Taking a look at #49, it looks like the type of quality that I'd need to accomplish what I'm after is well under way. If you'll pardon my ignorance for a second, does the inclusion of that Cakefile in the root of the repo mean that it'll get run with every push? Or would it need to be manually executed?

knowtheory commented 10 years ago

The cakefile is just a little build/runner script. It current is entirely opt-in, so that people who just want to add/modify recipes can just ignore it (which i think is also preferable).

I've been fooling around locally with regexp based extraction as a first step, which is roughly speaking okay for extracting quantities, the hard part is actually identifying the ingredients. May require a dictionary and some deeper parsing, all of which should be doable :)

We should figure out a structure for people to collaborate over extracting the relevant info! You're dropping things into an sqlite db?

evz commented 10 years ago

Yeah, just built tables based upon the current structure of the repo. Which, it all honesty, can probably be simplified (since all of the tables have basically the same columns). The only one that's slightly different is the FullTaco table which attempts to create relations to it's constituent parts. The script I wrote to build the basic DB just grabs the markdown for the INDEX.md page, renders it as HTML and then uses basic HTML/XML tree parsing to get all the links. Before loading each link, it decides what kind of a thing it is (base_layer, condiment, etc) and then saves the whole shebang (name, full URL and text of the recipe). In the case of the full tacos, it also follows the links to the other ingredient pages and attempts to make a relation (I'm using the full URL to the raw files as primary keys). It's using Flask and SQLAlchemy (well, mainly SQLAlchemy currently since I haven't actually made the Flask routes yet). The DB actually looks pretty OK (you can clone it and take a look yourself) but, as I mentioned when I opened the issue, it would be great to have more recipe metadata more easily machine readable.

At any rate, with what I've got now, I should be able to have a JSON endpoint together this evening that allows searches by name and, I guess, just a full text search on the recipe.

dansinker commented 10 years ago

This discussion is about 98% above my head, but just wanted to chime in that the balancing act on standards is ease of submission vs ease for machines. I err towards the former, yet understand that getting this into a format that allows for awesome things like taco randomizers etc, is great. So anyway, excited to see where this goes.

hunterowens commented 10 years ago

Another option: Use the Github API to create a Tacofancy submission engine, which creates a nice balance between ease of submission [go to this site, enter in recipe] and ease of categorization. It would also autoindex

dansinker commented 10 years ago

submission engine feels less fun to me though--the forking and pull requests are what make this a git repo vs just another site you submit shit into.

On Tue, Nov 5, 2013 at 11:10 AM, hunterowens notifications@github.comwrote:

Another option: Use the Github API to create a Tacofancy submission engine, which creates a nice balance between ease of submission [go to this site, enter in recipe] and ease of categorization. It would also autoindex

— Reply to this email directly or view it on GitHubhttps://github.com/sinker/tacofancy/issues/61#issuecomment-27792121 .

hunterowens commented 10 years ago

Maybe we could modify Travis.CI to validate the data? Not really too experienced with CI over here. That way, you could get a friendly message of

Please ensure your build passes by modifying the formatting. Here is what you need to change

When you submitted a pull request? It does seem like a potential pain in the ass though that would discourage contribution.

EDIT: Better Idea - Use Travis to help those who are maintain the index/repo keep content in a standardized form, but don't expose on every pull request.

knowtheory commented 10 years ago

Will Travis cut it? It'll need read and write access to the file system as well as push and pull access to github.

Github has a generic webhook api, so setting up a very small web service that did the same thing would be possible.

knowtheory commented 10 years ago

Oh sorry, i misunderstood question. It should be possible to set up travis to run a linter on the repository to check pull requests for proper formatting.

dansinker commented 10 years ago

Hey @knowtheory, with your Cakefile running smoothly now, does this discussion reemerge, or can we close?

cmcavoy commented 10 years ago

If you want to increase complexity by about 115%, you could ask that recipes be submitted in schema.org's recipe schema. The advantage there is that Google indexes schema.org JSON-ld schemas and adds them to their faceted search results. The disadvantage is it's difficult and would make it harder for people to submit.

If it makes you feel any better, we're having this exact debate on the open badges project - is the added complexity of using something like json-ld worth the added indexability? We're leaning towards not-at-all. We are describing the JSON BadgeAssertion structure in json-schema. It's going to let us build basic validators a little bit easier. If you went in this direction, you'd still be working in pull requests, and could write hooks that would validate the incoming PR's.

Anywho, WELCOME TO THE NERDERY OF THE SEMANTIC WEB. POPULATION: NERDS.

knowtheory commented 10 years ago

Yeah! Microformats are one of those things that i really love and want to support. But I think for ease of contribution sticking to schema.org microformats is an imposition. But its one of those things that definitely should be done in the case of publication to a more formal website imo.

knowtheory commented 10 years ago

@sinker i think since it's been several months... until we get some time at a hackathon or something i'm gonna call this defunct on my end for now.

Sorry :(

dansinker commented 10 years ago

That's OK sir.

On Wed, Jul 2, 2014 at 10:04 AM, Ted Han notifications@github.com wrote:

@sinker https://github.com/sinker i think since it's been several months... until we get some time at a hackathon or something i'm gonna call this defunct on my end for now.

Sorry :(

— Reply to this email directly or view it on GitHub https://github.com/sinker/tacofancy/issues/61#issuecomment-47788062.