Comparison with haskell-indexer

Hey, this looks really slick, great job!

There is a similar tool out there currently, haskell-indexer. Are you aware of it? If yes, could you shortly compare it to your approach? If no, could you summarise how your haskell-code-indexer works? Is it also a wrapper around ghc, or a plugin, does it use a custom parser, or something else?

We are currently building a tool that takes as input call graph information and lets you do queries on it (e.g. "give me all code paths from function f to function g): https://github.com/qrilka/haskell-navigation

Right now we are based on haskell-indexer; perhaps it would make sense for us to be able to work on the output of your tool as well?

Thanks!

I'm aware of haskell-indexer, although I haven't yet looked at the implementation details. What is clear is that both projects use GHC API http://hackage.haskell.org/package/ghc to extract information from the Haskell AST (so, no custom parsers or type checkers). Conceptually, both tools are quite similar.

The main difference is the output data. The output of haskell-indexer can be used by existing tools that support Kythe schema (e.g., Kythe's Web UI https://kythe.io/docs/exploring-kythe-sample-web-ui.html). The output of the indexer from haskell-code-explorer: https://haskell-code-explorer.mfix.io/package/haskell-code-explorer-0.1.0.0/show/src/HaskellCodeExplorer/Types.hs#L61 is supposed to be used by haskell-code-server (of course, it is possible to use haskell-code-explorer as a library to analyze and transform the output in any way you need).

Now, I'm getting familiar with Kythe schema and haskell-indexer project to see if it is possible to transform the output of the indexer from haskell-code-explorer to Kythe format.

As for creating the call graph for the haskell-navigation, you definitely should use haskell-indexer, because at the moment it is not possible to generate the call graph of a function using the output of the indexer from haskell-code-explorer.

Thanks for your explanations!

As for creating the call graph for the haskell-navigation, you definitely should use haskell-indexer, because at the moment it is not possible to generate the call graph of a function using the output of the indexer from haskell-code-explorer.

But why would that be? Your UI supports a Find references feature -- don't you need call graph information in order to do that?

Another question: It seems like currently one has to index every cabal package manually. haskell-indexer supports running against an entire stack build, indexing all packages on the way. Do you foresee that you can have similar automation for haskell-code-explorer? (How) Did automate building the large database of packages included in https://haskell-code-explorer.mfix.io?

References in Haskell code explorer are represented as a simple HashMap:

references :: HashMap ExternalId (Set IdentifierSrcSpan)

newtype ExternalId = ExternalId
  { getExternalId :: T.Text
  } deriving (Show, Eq, Ord, Generic, Data, Hashable, A.ToJSONKey)

data IdentifierSrcSpan = IdentifierSrcSpan
  { modulePath :: HaskellModulePath
  , line :: Int
  , startColumn :: Int
  , endColumn :: Int
  } deriving (Show, Eq, Ord, Generic, Data)

ExternalId looks like this: "base-4.10.1.0|GHC.Show|Val|show".

The function that fills the HashMap is a foldl' over all occurrences of identifiers in a source code.

To be able to create a call graph we need to keep track of which function calls which one.

haskell-indexer creates an XRef data type for each indexed Haskell module:

(from https://github.com/google/haskell-indexer/blob/12d6a46de5c19be708aff5bd0ff4bf63a488eada/haskell-indexer-translate/src/Language/Haskell/Indexer/Translate.hs)

data XRef = XRef
    { xrefFile      :: !AnalysedFile
    , xrefModule    :: !ModuleTick
    , xrefDecls     :: [Decl]
    , xrefCrossRefs :: [TickReference]
    , xrefRelations :: [Relation]
    , xrefImports :: [ModuleTick]
    }
deriving (Eq, Show)

-- | A Tick(et) is a globally unique identifier for some entity in the AST.
-- Not mandatory, but is ideally stable across multiple compilations.
--
-- Not related to GHC's SCC annotations (called ticks internally).
data Tick = Tick
    { tickSourcePath :: !SourcePath
    , tickPkgModule  :: !PkgModule
    , tickThing      :: !Text
      -- ^ The unqualified name of the entity.
    , tickSpan :: !(Maybe Span)
      -- ^ Can be broader or just loosely connected to the physical location
      --   of the entity in source. Should only be used for generating a unique
      --   tick string. Use other spans in Decl for source linking.
    , tickUniqueInModule :: !Bool
      -- ^ If true, the generated unique name can omit the span.
      --   This usually signals top levelness too.
      --   TODO(robinpalotai): make the distinction clear? Rename?
    , tickTermLevel :: !Bool
      -- ^ Needed to disambiguate same name occuring in term and type level.
    }
deriving (Eq, Ord, Show)

data TickReference = TickReference
    { refTargetTick       :: !Tick
    , refSourceSpan       :: !Span
      -- ^ The precise location of the reference. Frontends probably want to
      --   make this a hyperlink on the UI.
    , refHighLevelContext :: !(Maybe Tick)
      -- ^ The context from which the reference originates. Theoretically a
      --   frontend could infer the context (enclosing scope) from the reference
      --   source span, but 1) it is not obvious how large context to choose,
      --   and 2) since the compiler already has the scoping info, it is easier
      --   for the indexer to emit it.
      --
      --   Here we pragmatically set the context to the current top-level
      --   function, if any. On the UI, this might show up as the next element
      --   in the call chain - see 'ReferenceKind'.
    , refKind             :: !ReferenceKind
} deriving (Eq, Show)

data ReferenceKind
    = Ref      -- ^ Reference
    | Call     -- ^ Function call
    | TypeDecl -- ^ Usage of identifier in type declaration, left to "::"
deriving (Eq, Ord, Show)

Tick here is a Haskell identifier/function, TickReference represents the fact that one function calls another function. [TickReference] can be thought of as a list of edges in a graph.

Here XRef is converted to Kythe schema: https://github.com/google/haskell-indexer/blob/12d6a46de5c19be708aff5bd0ff4bf63a488eada/haskell-indexer-frontend-kythe/src/Language/Haskell/Indexer/Frontend/Kythe.hs#L61

About the automation:

I have a small (quick-and-dirty) Haskell program that builds and indexes (using haskell-code-indexer) Cabal packages on my local machine (btw, indexing 480 packages takes approximately 2 hours). I haven't released that program yet because it contains hard-coded paths. I will clean it up and publish it.

Author of haskell-indexer here. Thanks for the comparison so far! I'm really happy about multiple tools popping up in the same space, as it provides chance for knowledge exchange and healthy competition :)

@nh2: about the callgraph, indeed the refHighLevelContext in TickReference pasted by @alexwl makes that possible.

Now both haskell-code-explorer and haskell-indexer dives manually into the GHC AST 1, which as I learned not something you want to do (well, it's nice to get to know the GHC AST). The representation is too irregular, and version updates introduce smaller or larger changes.

For a while I had plan to rebase the GHC backend on haskell-tools-ast, which promised a higher-level and less volatile API. But with HIEFiles coming maybe in GHC 8.8, it likely makes sense for both projects to migrate info extraction to HIE. Well, the proof of the pudding is trying to compile it, so who can tell, but...

As for frontend, haskell-indexer only has kythe.io's debug frontend. You can see it work at stuff.codereview.me, but it is quite quirky. I should have a look if using haskell-code-explorer's UI as a generic Kythe frontend is feasible.

As for why would one choose to emit Kythe format entries? My main motivation (apart from making Haskell xrefs available in Google CodeSearch) was that Kythe has indexers for other languages (so far C++, Java, Go, Protobuf are stable, some others like Python, JS in the experimental phase), and has a concept of cross-language references. So if executed well (which haskell-indexer does not do yet), one could get xrefs to FFI functions, protobufs, or establish a cross-language callgraph.

@robinp Thank you for the explanations!

Also, thank you for the Kythe-LSP/LSIF comparison (https://gist.github.com/robinp/76f9d3d91387da5162f773895d4e1d15). It really helps to understand how Kythe and LSP/LSIF relate to each other.

I should have a look if using haskell-code-explorer's UI as a generic Kythe frontend is feasible.

It is possible, but it would require to abstract out all Haskell-specific logic from both JS Web UI and HTTP API server.

A bit of background info:

My goals with haskell-code-explorer are to create a code-browsing tool that deeply understands Haskell and to experiment with code intelligence features that are Haskell-specific and are not considered 'standard' (not supported by LSP yet).

As a result of this, Haskell-specific (or GHC-specific, since GHC is the de-facto standard Haskell compiler) logic is everywhere in the code of the Web UI.

For example, the JS code responsible for syntax highlighting looks like this:

if(idOcc.sort.tag === 'TypeId') {
  color = colorTheme.typeColor;
} else if(idOcc.description === "HsLit" ||
          idOcc.description === "HsOverLit"||
          idOcc.description === "LitPat" ||
          idOcc.description === "NPat" ||
          idOcc.description === "NPlusKPat" ||
          idOcc.description === "OverLit") {
  color = colorTheme.literalColor;
} else {
  /* *** */
}

alexwl / haskell-code-explorer

Comparison with haskell-indexer #2