beerendlauwers / HaskAnything

Hask Anything! is a website aimed at collecting and organizing the collective knowledge of the Haskell community.
http://haskanything.com/
BSD 3-Clause "New" or "Revised" License
17 stars 7 forks source link

Import the DoHaskell content #33

Open beerendlauwers opened 8 years ago

beerendlauwers commented 8 years ago

@mitchellwrosen has kindly provided me with a database dump of his dohaskell.com website. He has also generated a bunch of YAML metadata for the content, available here: https://github.com/mitchellwrosen/dohaskell/blob/master/resources-dump.yaml

Scraper source is here: https://gist.github.com/beerendlauwers/102c833c7a98babede60fe05e4dc789b (for my own reference).

There are still a few categories that have to be added to accomodate most of the DoHaskell content, notably #10, #4, #2 and a "Article" category for some of the more generic ones (Blog posts, Medium articles, etc).

beerendlauwers commented 7 years ago

WIP here: https://github.com/mitchellwrosen/dohaskell/blob/666eaa575a4e63ef2a9e41c66a20ad3f5950d989/resources-dump-wip.yaml

duplode commented 7 years ago

I wrote a rough proof of concept converter that reads the YAML dump, generates Hakyll files in (an approximation of) the HaskAnything format for some of the entry types in it, and spits out the entries it didn't handle to a new YAML file. I don't know exactly how useful that will turn out to be -- things like cleaning up the 334 tags in the YAML dump and adding summaries to the entries will require manual work anyway -- but with some more polish a script like this might save a fair bit of time. By the way, the linked Gist also contains a modified version of the YAML dump with the necessary fixes so that Data.Yaml is able to parse it.

beerendlauwers commented 7 years ago

Awesome, thanks! This already brings us a lot closer :)

duplode commented 7 years ago

@beerendlauwers You're welcome. I have improved the converter a bit -- now it can deduplicate tags, keep the output fields sorted, and recognise more content types and library tags -- and tweaked a few titles in the YAML dump to avoid file name collisions. I have put that in a proper repository, along with sample output and the full lists of DoHaskell tags and types. If you find the output is looking good, I can begin shaping it into a pull request for HaskAnything.

beerendlauwers commented 7 years ago

Thanks a bunch!

To answer the questions you pose in the README.md:

Are functional pearls best filled under "articles" or "papers"?

Good question. I think we'll have to look at this on a case-by-case basis. We can tag them with "Functional Pearl" in any case.

How many new content types are necessary to cover the still unprocessed DoHaskell content types?

I'll have to go over the remaining stuff in dhTypes.txt. Will get back to you on this.

Do we need separate tags for reflection (the concept) and reflection (the Edward Kmett library)? What about "lenses" vesus "lens"?

No, we have a metadata field for tags and libraries, already. The concept can be added as a tag, the library as a library. The reason for splitting them up is that when we start with linking up all the Hackage libraries, we can identify which content refers to them.

How to handle capitalisation, specially relative to the existing HaskAnything content?

Capitalisation of?

duplode commented 7 years ago

Good question. I think we'll have to look at this on a case-by-case basis. We can tag them with "Functional Pearl" in any case.

Yup. That's in line with why I slipped in a dohaskell-type field -- with that, even if they are added under a single category at first we can still identify them and review the classification with relative ease.

No, we have a metadata field for tags and libraries, already. The concept can be added as a tag, the library as a library.

In that case, I will figure out a sensible way to add "lens" and "reflection" library labels to the entries that need them. In hindsight, I didn't phrase that very clearly. When I said "tags", I really meant "values in either the library field or the tags one", as in my mental model the library labels are just tags that go to a separate field because they immediately refer to a library. (My mental model might be wrong, though, so please correct me if need be!)

Capitalisation of?

Oops, I dropped a part of the sentence :) I meant capitalisation of tags. Unlike HaskAnything, DoHaskell had mostly lowercase tags, and so we have e.g. "cryptography" in the YAML dump versus "Cryptography" in the HaskAnything site. In any case, that's a very minor issue, as it would be simple to solve any such inconsistencies even after the import is done.

beerendlauwers commented 7 years ago

When I said "tags", I really meant "values in either the library field or the tags one", as in my mental model the library labels are just tags that go to a separate field because they immediately refer to a library. (My mental model might be wrong, though, so please correct me if need be!)

They're processed with the Tags datatype in Hakyll, but apart from that they're kept different: they are different facets (see http://haskanything.com/filter.html) and, of course, libraries will be linked to the actual Hackage packages.

capitalisation of tags

Yeah, we can do some naïve capitalisation ("if it's one word, capitalize the first letter"), but we'll have to go through those manually, probably.