emacs-ess / ESS

Emacs Speaks Statistics: ESS
https://ess.r-project.org/
GNU General Public License v3.0
622 stars 162 forks source link

Eliminating unnecessary diversity in Emacs noweb modes #93

Closed nrnrnr closed 6 years ago

nrnrnr commented 10 years ago

I am the author of noweb. For some time I have been aware that noweb support for Emacs users has been in a parlous state. I was very pleased to learn of your efforts to create an improved noweb-mode as part of ESS.

At present I am aware of at least three other competitors:

I have been using ESS noweb mode on an experimental basis for a few weeks, probably amounting to no more than 10 hours in total. I am experiencing intermittent failures, not just in noweb-related functionality, but in basic Emacs commands such as kill-line and query-replace. It's possible that these issues occur because I have both ess-noweb-mode and noweb-mode installed on the same system. It's possible that the issues occur because I'm using Emacs 23.4. But setting these issues aside, I am excited by your work and I find my preliminary experience is very promising. And if the issues were resolved, I'd be able to endorse your work to those of my users who use Emacs.

I am wondering if you would like to join forces? The time I have available for noweb is very limited---I am rewarded very heavily for new work and not at all for improvements to noweb---but the chance to partner with somebody who has real Emacs skills is too promising to ignore. Would you be interested in working together to refactor the code so it can stand independent of ESS, and perhaps to incorporate some of Dave Love's good ideas, and squeeze out some bugs? I would love to be able to replace the old, crappy code I am distributing with something based on your work.

vspinu commented 10 years ago

Hi,

Thanks for the proposal. I am the author of polymode. Which pushes Dave Love's idea many level further. It is highly versatile and fully extensible. Creating new modes is usually a couple of lines of declarations (see poly-R.el for examples where there are a dozen of modes defined in less than 200 lines of code).

It is also specifically designed for literate programming. There are specialized classes for weavears and exporters and they respect the polymode inheritance.

So, the answer to your question is defenitely "yes"! I would be so happy to join the forces. Long overdue IMO. I am myself overcomitted on so many levels that it makes me cry. Polymode started an year ago and it is only now that I am barely making the first release :(

I understand that I am turning the argument the other way around by asking you to join polymode project. But, given that polymode is way more general in its intent and is easily extensible through a class system, my bias aside, I think it is the way to go.

I am in the process of documenting the development api. The very early draft is already there.

Thanks for reaching out to us.

nrnrnr commented 10 years ago

Taking this to email.

nrnrnr commented 10 years ago

Thanks for the proposal. I am the author of polymode. Which pushes Dave Love's idea many level further. It is highly versatile and fully extensible. Creating new modes is usually a couple of lines of declarations (see poly-R.el for examples where there are a dozen of modes defined in less than 200 lines of code).

It is also specifically designed for literate programming. There are specialized classes for weavears and exporters and they respect the polymode inheritance.

I had a look at the code base. There's a lot going on there. I sympathize with the idea of building on powerful abstractions (which doesn't happen often enough in the emacs world), but for somebody brand new to come in and contribute, the documentation is a little thin. I am certainly willing to try to help push things forward, but in order to make a contribution I think I will need some guidance.

In order even to make a start, I will need to know how to set the major mode to be used in code chunks.

I am in the process of documenting the development api. The very early draft is already there.

I had a look. I've done a little prototype-based OO programming in Lua, and I'm relatively comfortable with the concepts, but I've never used CLOS or eieio, so I'm going to be fairly slow and useless, at least at the start. It looks like modes/poly-noweb.el is still pretty sparse, and in particular I don't see how to set the major mode for either code chunks or documentation chunks. Looking at the documentation and the code I see that I need to have pm-basemode and pm-submode objects, but I'm not sure how to create them or where to splice them in.

I guess that's where I need to begin.

Norman

vspinu commented 10 years ago

Hi Norman,

I was a bit slow with API documentation because the naming conventions were not settled. Some more re-factoring and tiding is on the way. Then I will proceed with detailed docs and examples. Hopefully already this week.

It looks overwhelming without documentation but the idea is pretty simple. There is a pm-config object that represents each polymode. Each time a polymode is initialized (just like any other mode in emacs) the root object (pm-config/noweb) is cloned and new object is stored locally in the buffer with name pm/config. This is how prototype inheritance works - through cloning. Pm/config is shared across all indirect buffers (one indirect buffer per submode). pm/config stores all the necessary data in internal slots that with names that start with "-" (like -basemode, -chunkmodes). The entire communication between indirect buffers is happening through this object. As pm/config is the same in all buffers there is no need to move/copy stuff around.

Submodes are also represented by objects (for example pm-submode/noweb). Each base and indirect buffer stores a local submode object in pm/submode. There are two types of submodes: basemode and chunkmodes. The chunkmodes are discovered dynamically (currently by jit-lock) and placed into the list slots -basemode and -chunkmodes of pm/config.

Now to your question on selecting the submode for nw files. I guess you mean interactive command, right? There is no such a command but it would be easy to add. I didn't think of this user pattern.

The user pattern that I had in mind is that for every possible chunk mode, you would need to create a new polymode. Example from poly-R.el:

(require 'poly-noweb)

;; inherit new config object representing noweb+R mode from root ;; pm-config/noweb (defcustom pm-config/noweb+R (clone pm-config/noweb :chunkmode 'pm-chunk/noweb+R) "Noweb for R configuration" :group 'polymode-configs :type 'object)

;; Make new chunk submode from root pm-chunk/noweb. Note that ;; :chunkmode of pm-config/noweb+R is pointing to this object. (defcustom pm-chunk/noweb+R (clone pm-chunk/noweb :mode 'R-mode) "Noweb for R" :group 'polymode-chunkmodes :type 'object)

;; define polymode (define-polymode poly-noweb+r-mode pm-config/noweb+R)

Now, poly-noweb+r-mode can be used as standard emacs mode. So you can activate it with M-x poly-noweb+r-mode, or with "mode:" declaration at the top of your file, or with explicit file association

(add-to-list 'auto-mode-alist '(".Rnw" . poly-noweb+r-mode))

The good thing about this design is that users can customize the root object pm-config/noweb as well as children objects. All noweb children will inherit the customization from the root object. Another good thing is that even low level things like chunk header regexp could be modified in children.

In order to set a new chunkmode in .nw files, one would need to set the :chunkmode slot of pm/config object to point to a pm-chunk/XXX object. I think that might be enough and the mode will be re-initialized automatically, but I am not sure.

The question is, do we really want such a command? How useful is it? Do you often change the mode of the chunk?

Vitalie

Norman Ramsey on Thu, 08 May 2014 11:57:06 -0700 wrote:

Thanks for the proposal. I am the author of polymode. Which pushes Dave Love's idea many level further. It is highly versatile and fully extensible. Creating new modes is usually a couple of lines of declarations (see poly-R.el for examples where there are a dozen of modes defined in less than 200 lines of code).

It is also specifically designed for literate programming. There are specialized classes for weavears and exporters and they respect the polymode inheritance.

I had a look at the code base. There's a lot going on there. I sympathize with the idea of building on powerful abstractions (which doesn't happen often enough in the emacs world), but for somebody brand new to come in and contribute, the documentation is a little thin. I am certainly willing to try to help push things forward, but in order to make a contribution I think I will need some guidance.

In order even to make a start, I will need to know how to set the major mode to be used in code chunks.

I am in the process of documenting the development api. The very early draft is already there.

I had a look. I've done a little prototype-based OO programming in Lua, and I'm relatively comfortable with the concepts, but I've never used CLOS or eieio, so I'm going to be fairly slow and useless, at least at the start. It looks like modes/poly-noweb.el is still pretty sparse, and in particular I don't see how to set the major mode for either code chunks or documentation chunks. Looking at the documentation and the code I see that I need to have pm-basemode and pm-submode objects, but I'm not sure how to create them or where to splice them in.

I guess that's where I need to begin.

Norman

— Reply to this email directly or view it on GitHub.

nrnrnr commented 10 years ago

I was a bit slow with API documentation because the naming conventions were not settled. Some more re-factoring and tiding is on the way. Then I will proceed with detailed docs and examples. Hopefully already this week.

Great!

It looks overwhelming without documentation but the idea is pretty simple. There is a pm-config object that represents each polymode.

OK, basic questions: in the world of ideas, what is a polymode? What is a submode?

Now to your question on selecting the submode for nw files. I guess you mean interactive command, right?

It will be probably be necessary at times (when a file contains multiple modes in code chunks), but my first priority is to be able set the modes using buffer-local variables. I need not only to set the modes but also to set variables relevant to those modes.

Here's an example that works with the old noweb-mode (and with Dave Love's version):

% -*- mode: Noweb; noweb-code-mode: fundamental-mode; tab-width: 4; c-indent-level: 4; c-basic-offset: 4 ; tex-main-file: book.nw -*-

Here's a similar example that sort of works with ess-noweb-mode:

% -*- mode: ess-noweb; ess-noweb-default-code-mode: c-mode; noweb-code-mode: c-mode; tab-width: 4; c-indent-level: 4; c-basic-offset: 4 ; tex-main-file: book.nw -*-

It only "sort of" works because actually the value of c-indent-level is not set the way it should be.

The user pattern that I had in mind is that for every possible chunk mode, you would need to create a new polymode.

Once I understand what a polymode is, that seems like a reasonable requirement. But I can reuse that polymode with different values of buffer-local variables, right?

Example from poly-R.el:

(require 'poly-noweb)

;; inherit new config object representing noweb+R mode from root ;; pm-config/noweb (defcustom pm-config/noweb+R (clone pm-config/noweb :chunkmode 'pm-chunk/noweb+R) "Noweb for R configuration" :group 'polymode-configs :type 'object)

;; Make new chunk submode from root pm-chunk/noweb. Note that ;; :chunkmode of pm-config/noweb+R is pointing to this object. (defcustom pm-chunk/noweb+R (clone pm-chunk/noweb :mode 'R-mode) "Noweb for R" :group 'polymode-chunkmodes :type 'object)

;; define polymode (define-polymode poly-noweb+r-mode pm-config/noweb+R)

I'm afraid that I see some details but I do not grasp the big picture. All I'm getting is that a polymode is built from a thing called pm-config/noweb+R, and that there is another thing (pm-config/noweb+R) which points to the first thing? What is the purpose of having two things? What is the name of the kind of thing? That is, what kind of thing is pm-chunk/noweb? What about pm-config/noweb?

I went and looked at the code, and they aren't defined by cloning... And I'm having trouble connecting them with the doco at

https://github.com/vitoshka/polymode/tree/master/modes

(FYI, installing Emacs 24 is going to disrupt my system pretty significantly, so I'm not ready to do it until I know I have a few hours to get problems sorted out, and I can actually try out polymode. But what it means for now is that I can't use any of the emacs documentation tools like C-h f.)

Now, poly-noweb+r-mode can be used as standard emacs mode. So you can activate it with M-x poly-noweb+r-mode, or with "mode:" declaration at the top of your file, or with explicit file association...

OK, this part I get.

(add-to-list 'auto-mode-alist '(".Rnw" . poly-noweb+r-mode))

The good thing about this design is that users can customize the root object pm-config/noweb as well as children objects.

I don't really understand the object model, so it's not yet clear to me how to benefit from customization. But I'll take it on faith that it's good.

In order to set a new chunkmode in .nw files, one would need to set the :chunkmode slot of pm/config object to point to a pm-chunk/XXX object. I think that might be enough and the mode will be re-initialized automatically, but I am not sure.

What about having a buffer-local variable so that I have multiple files using the same poly-noweb-mode, but each individual file has its own mode for code chunks?

The question is, do we really want such a command? How useful is it? Do you often change the mode of the chunk?

I have some files that contain a mix of SML code chunks and Scheme code chunks, or a mix of C code chunks and Scheme code chunks. When I edit such a file it is essential that I be able to set the mode correctly for the chunk I am editing.

(In the glorious future I would love to be able to specify the correct mode for each root chunk and to have that mode propagate to other chunks using noweb's def/use chains, but that's a problem for another time.)

Norman

vspinu commented 10 years ago

Norman Ramsey on Fri, 09 May 2014 11:35:52 -0700 wrote:

[...]

OK, basic questions: in the world of ideas, what is a polymode? What is a submode?

These questions are actually addressed in the dev doc (https://github.com/vitoshka/polymode/tree/master/modes#polymodes-and-configs)

I agree that the docs are not crystal clear at this stage.

Here's an example that works with the old noweb-mode (and with Dave Love's version):

% -- mode: Noweb; noweb-code-mode: fundamental-mode; tab-width: 4; c-indent-level: 4; c-basic-offset: 4 ; tex-main-file: book.nw --

This should eventually work as expected. It wasn't the priority so far.

I'm afraid that I see some details but I do not grasp the big picture. All I'm getting is that a polymode is built from a thing called pm-config/noweb+R, and that there is another thing (pm-config/noweb+R) which points to the first thing? What is the purpose of having two things? What is the name of the kind of thing? That is, what kind of thing is pm-chunk/noweb? What about pm-config/noweb?

No, polymode is not "built from pm-config/noweb+R" it is represented by an object cloned from pm-config/noweb+R and stored in pm/config local variable. Most of the methods in polymode-methods.el are then dispatched on this config object. The rest of the objects are dispatched on submode objects that represent the innermodes of the buffer.

Another "things" represent submodes. The base mode (latex) and the chunkmode (R in this case). Some methods dispatch on these submode objects. Config object must be aware of what basemode and what submodes it should instantiate when it meets a chunk. This is why they are linked.

BTW, I am thinking to change "chunkmode" into "innermode" but I am not sure. The idea is that there is always an outermode which I call basemode and withing basemode are chunks of code in other language. This is why I call them chunkmodes. Any ideas on this?

I plan to write a glossary of all the terms used, but don't want to do that till all the names are settled.

I went and looked at the code, and they aren't defined by cloning...

All objects (pm-config, pm-basemode, pm-chunkmode) that are defined at run-time are instantiated through cloning.

(FYI, installing Emacs 24 is going to disrupt my system pretty significantly, so I'm not ready to do it until I know I have a few hours to get problems sorted out, and I can actually try out polymode. But what it means for now is that I can't use any of the emacs documentation tools like C-h f.)

[...]

Emacs 24 brings a lot of new stuff, like eieio and package manager. Earlier you switch is the better IMO.

What about having a buffer-local variable so that I have multiple files using the same poly-noweb-mode, but each individual file has its own mode for code chunks?

The question is, do we really want such a command? How useful is it? Do you often change the mode of the chunk?

I have some files that contain a mix of SML code chunks and Scheme code chunks, or a mix of C code chunks and Scheme code chunks. When I edit such a file it is essential that I be able to set the mode correctly for the chunk I am editing.

What is the use of it? How does the weaver recognize the modes? As far as I know noweb is not designed for this user pattern. Thus polymode by design discourages such use by enforcing 'pm-config-one' class. If you need different modes in noweb chunks then you should extend noweb syntax to specify the mode per chunk like <<name, mode = "sml">>=. Then use pm-config-multi-auto instead of pm-config-one class to define 'poly-noweb-auto-mode'. This will make noweb similar to how markdown and org-mode work by detecting the mode of the chunk automatically.

If you think this pattern is common I can easily add such a polymode.

Vitalie

nrnrnr commented 10 years ago

OK, basic questions: in the world of ideas, what is a polymode? What is a submode?

These questions are actually addressed in the dev doc... I agree that the docs are not crystal clear at this stage.

Once I think I understand, I will see if I can help with that.

Here's an example that works with the old noweb-mode (and with Dave Love's version):

% -- mode: Noweb; noweb-code-mode: fundamental-mode; tab-width: 4; c-indent-level: 4; c-basic-offset: 4 ; tex-main-file: book.nw --

This should eventually work as expected. It wasn't the priority so far.

All right. If you want me to do anything, I have to have this support. I'm willing to try to build it, but I will need a sketch to start with.

I'm afraid that I see some details but I do not grasp the big picture. All I'm getting is that a polymode is built from a thing called pm-config/noweb+R, and that there is another thing (pm-config/noweb+R) which points to the first thing? What is the purpose of having two things? What is the name of the kind of thing? That is, what kind of thing is pm-chunk/noweb? What about pm-config/noweb?

No, polymode is not "built from pm-config/noweb+R" it is represented by an object cloned from pm-config/noweb+R and stored in pm/config local variable. Most of the methods in polymode-methods.el are then dispatched on this config object.

As a user, I have no idea about these methods. And this isn't the documentation I'm loooking for.

I've placed a first cut at draft documentation in the readme.md at

https://github.com/nrnrnr/polymode#high-level-view

which is a fork of yours. Please have a look and tell me what you think.

Also, what should be the name for "the major mode that a polymode mimics?" The documentation needs to talk about this a lot, so it needs a name.

The rest of the objects are dispatched on submode objects that represent the innermodes of the buffer.

I don't understand why there are 'submodes' and why they are distinct from 'polymodes'. Who needs to know about this distinction? Users? All developers? Some developers?

Please note that in the current doco, the introduction of submodes is tautological:

Submodes (basemodes and chunkmodes) are objects that encapsulate functionality of the polymode's submodes.

Maybe 'submode' is a term of art in the emacs world? I'm not finding it in the manual except for a few special cases.

Another "things" represent submodes. The base mode (latex) and the chunkmode (R in this case).

Why should base mode and chunkmode have different status?

BTW, I am thinking to change "chunkmode" into "innermode" but I am not sure. The idea is that there is always an outermode which I call basemode and withing basemode are chunks of code in other language. This is why I call them chunkmodes. Any ideas on this?

Yes, but I can speak only based on my experience with noweb:

Many years of painful experience has taught me that the benefits of maintaining a consistent Debian installation outweigh the benefits of upgrading any one package. I am willing to make an exception if I get real Emacs support for noweb out of the deal. I'm not willing to make an exception on speculation or just because Emacs 24 is better.

I have some files that contain a mix of SML code chunks and Scheme code chunks, or a mix of C code chunks and Scheme code chunks. When I edit such a file it is essential that I be able to set the mode correctly for the chunk I am editing.

What is the use of it?

Programs written in multiple languages.

How does the weaver recognize the modes?

For my applications, it is rarely useful for the weaver to recognize the modes. In these rare cases, noweb captures some metadata that gives source locations of various parts. For example, not to index C identifiers in Scheme code.

As far as I know noweb is not designed for this user pattern.

As the author and designer of noweb, I can say definitively that noweb is designed for exactly this user pattern.

Thus polymode by design discourages such use by enforcing 'pm-config-one' class.

I have no idea what this means.

At any given moment I am certainly willing to pretend that all noweb code chunks should be associated with the same emacs mode (polymode). But I do need to be able to change that mode dynamically.

If you need different modes in noweb chunks then you should extend noweb syntax to specify the mode per chunk like <<name, mode = "sml">>=.

Absolutely not. Whatever mechanism may be used for this purpose, placing the burden on the user (and potentially polluting the chunk names in the output) is not it.

Alternative mechanism: give a regular expression that characterizes the names of all noweb root chunks that are in a given mode. For example, "^[^ \t]*.[ch]$" might characterize root chunks that should be in c-mode. Mode information can propagate to other chunks by use-def chains. And of course there would need to be a default.

Then use pm-config-multi-auto instead of pm-config-one class to define 'poly-noweb-auto-mode'. This will make noweb similar to how markdown and org-mode work by detecting the mode of the chunk automatically.

If you think this pattern is common I can easily add such a polymode.

It is not my most urgent need. My most urgent needs remain:

Norman

vspinu commented 10 years ago

Norman Ramsey on Mon, 12 May 2014 10:59:12 -0700 wrote:

[...]

I've placed a first cut at draft documentation in the readme.md at

https://github.com/nrnrnr/polymode#high-level-view

Ok, I see now. You are missing the picture. My bad. There are at least 3 meanings of polymode and submodes: emacs function to initialize a mode, emacs' abstract notion of a mode and an eieio object that represents that mode. In the current docs these meanings are used interchangeably and this is the reason for the tautology that you have noticed.

From user prospective emacs modes and polymodes are functionally the same. This is why no mention of submodes, objects etc. should not be on the user page.

I will add precise definitions of all the terms to the dev doc once I am settled on their names. I will be back with you when that's done.

Noweb-mode confounds doc chunks and language chunks, both conceptually and at the code level. I think this is wrong. There is always a host (base) language that "contains" other language spans. I intend to use "code span" for what noweb calls chunks and reserve "chunks" for rigorously delimited spans of code in a language that is not the host language.

What is the use of it?

Programs written in multiple languages.

How does the weaver recognize the modes?

For my applications, it is rarely useful for the weaver to recognize the modes. In these rare cases, noweb captures some metadata that gives source locations of various parts. For example, not to index C identifiers in Scheme code.

Can you please provide an example of a complex noweb file with multiple languages? What do you mean by root chunk exactly?

I am lost. What applications do you mean concretely?

If you need different modes in noweb chunks then you should extend noweb syntax to specify the mode per chunk like <<name, mode = "sml">>=.

Absolutely not. Whatever mechanism may be used for this purpose, placing the burden on the user (and potentially polluting the chunk names in the output) is not it.

Alternative mechanism: give a regular expression that characterizes the names of all noweb root chunks that are in a given mode. For example, "^[^ \t]*.[ch]$" might characterize root chunks that should be in c-mode. Mode information can propagate to other chunks by use-def chains. And of course there would need to be a default.

Confounding chunk names with language indicators doesn't look like a good design to me. But I agree, it is indeed parsimonious. Also, how naming chunks with some mode-containing names and then specifying a regexp is not a "burden"?

Chunks in markdown, org-mode and web related files always have specific indicators to uniquely identify the language of the chunk. People are used to this clean idea.

That being said, if noweb specification is extended and chunk names can identify the language in a standardized way, I will add that to polymode specs immediately.

Vitalie

vspinu commented 10 years ago

Hi Norman.

The dev doc is ready. I have gone through several stages of refactoring and settled down on parsimonious naming conventions. It also helped clearing my own mind. And I acknowledge that previous mode/polymode/chunkmode/submode/basemode etc. wording was quite a messup.

Thanks for all the input. It was very helpful in making things straight.

nrnrnr commented 10 years ago

I've placed a first cut at draft documentation in the readme.md at

https://github.com/nrnrnr/polymode#high-level-view

Ok, I see now. You are missing the picture. My bad. There are at least 3 meanings of polymode and submodes: emacs function to initialize a mode, emacs' abstract notion of a mode and an eieio object that represents that mode. In the current docs these meanings are used interchangeably and this is the reason for the tautology that you have noticed...

From user prospective emacs modes and polymodes are functionally the same. This is why no mention of submodes, objects etc. should not be on the user page.

I will add precise definitions of all the terms to the dev doc once I am settled on their names. I will be back with you when that's done.

For whatever they may be worth, here are a few suggestions:

I'm glad you're thinking hard about names---given that emacs lisp is such a dynamic language, naming becomes extra important.

Noweb-mode confounds doc chunks and language chunks, both conceptually and at the code level.

I'm not sure what you mean by 'confound' here. I think the sense you mean is 'fail to discern the differences between', but I don't think that characterization applies to noweb, which knows the difference between a documentation chunk and a code chunk. (About noweb-mode more specifically I am ignorant.)

I think this is wrong. There is always a host (base) language that "contains" other language spans. I intend to use "code span" for what noweb calls chunks and reserve "chunks" for rigorously delimited spans of code in a language that is not the host language.

On this subject I think I can speak with some authority. In the design of noweb I was extremely careful to craft an abstraction that says "a file is a sequence of chunks that appear in any order." Moreover, there are two kinds of chunks: documentation chunks and code chunks. There is no 'containment' structure and no alternation. For example, multiple documentation chunks can follow one another without any intervening code chunks, and (mutatis mutandis) the same for code chunks. This design decision was one of my best decisions, as it made all the tools (and the documentation) simpler.

I recognize that most users have a mental model of 'containment' in the way they write and sequence their documentation chunks. But none of the noweb tools are aware of or benefit from this mental model.

I hope that for noweb-mode at least, you will reconsider your intention to change the model and the terminology.

(The best paper to read about noweb and its design is at http://www.cs.tufts.edu/~nr/pubs/lpsimp.pdf. The writing makes me cringe today, but it's the best record of what I had in mind.)

For my applications, it is rarely useful for the weaver to recognize the modes. In these rare cases, noweb captures some metadata that gives source locations of various parts. For example, not to index C identifiers in Scheme code.

Can you please provide an example of a complex noweb file with multiple languages?

Yes, I have attached one. It is the main source file for the Debian package 'nbibtex'. It is a simple one, with only two languages. The more complex ones I am working on are for a book on programming languages, and I cannot distribute the source files.

What do you mean by root chunk exactly?

I mean the name of a chunk that is intended to be passed to notangle using the -R option. I see that this term does not appear in the man page, but you will see it in the article mentioned about.

I am lost. What applications do you mean concretely?

By 'application' I mean anything that produces code or documentation from a set of noweb files. In addition to the basic tangle and weave applications, I have a bunch of stuff for indexing and cross-reference. None of this stuff works on any kind of mode recognition---instead, I use noweb to relate locations in the noweb file to locations in the derived (output) files.

It may be worth saying that in 25 years, I have never used any of noweb-mode's tools weaving, tangling, navigating to chunks, and so on. When I need to weave or tangle I do it using C-c C-c make (aka M-x compile make). The one special-purpose command I have used from time to time is narrow-to-chunk.

If you need different modes in noweb chunks then you should extend noweb syntax to specify the mode per chunk like <<name, mode = "sml">>=.

Absolutely not. Whatever mechanism may be used for this purpose, placing the burden on the user (and potentially polluting the chunk names in the output) is not it.

Alternative mechanism: give a regular expression that characterizes the names of all noweb root chunks that are in a given mode. For example, "^[^ \t]*.[ch]$" might characterize root chunks that should be in c-mode. Mode information can propagate to other chunks by use-def chains. And of course there would need to be a default.

Confounding chunk names with language indicators doesn't look like a good design to me. But I agree, it is indeed parsimonious. Also, how naming chunks with some mode-containing names and then specifying a regexp is not a "burden"?

Two ways:

I have no objection to this idea. Chunks in noweb are related through chains of definition and use, which link together to form a web. Information such as the mode of chunks or the language in use propagates effortlessly through the web. Literate programmers are used to this clean idea.

That being said, if noweb specification is extended and chunk names can identify the language in a standardized way, I will add that to polymode specs immediately.

I have no intention of specifying a standard for noweb---each author should retain the power to choose conventional or unconventional names in whatever way makes sense for his or her document.

May I propose that for the time being, we table the issue of multiple code modes active simultaneously? As long as I can change the code mode currently in effect, and can initialize it using a buffer-local variable, I can start working with polymode, and then it will be possible to develop a specification and tools incrementally.

Norman % -- mode: noweb; noweb-code-mode: lua-mode --

\documentclass{article} \usepackage{fullpage} \usepackage{noweb,url} \usepackage[hypertex]{hyperref} \noweboptions{smallcode}

\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}

\def\NbibTeX{{\rm N\kern-.05em{\sc bi\kern-.025em b}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}

\let\bibtex\BibTeX \let\nbibtex\NbibTeX

\title{A Replacement for \bibtex\ (Version )} \author{Norman Ramsey}

\setcounter{tocdepth}{2} %% keep TOC on one page \def\lbrace{\char123} \def\rbrace{\char125}

\begin{document} @

\maketitle

\tableofcontents

\clearpage

\section{Overview}

The code herein comprises the nbib'' package, which is a collection of tools to help authors take better advantage of \BibTeX\ data, especially when working in collaboration. The driving technology is that instead of using \BibTeX\ keys,'' which are chosen arbitrarily and idiosyncratically, nbib builds a bibliography by searching the contents of citations. \begin{itemize} \item \texttt{nbibtex} is a drop-in replacement for \texttt{bibtex}. Authors' \verb+\cite{+\ldots\kern-2pt \verb+}+ commands are interpreted either as classic \bibtex\ keys (for backward compatibility) or as search commands.
Thus, if your bibliography contains the classic paper on type inference, \texttt{nbibtex} should find it using a citation like \verb+\cite{damas-milner:1978}+, or \verb+\cite{damas-milner:polymorphism}+, or perhaps even simply \verb+\cite{damas-milner}+---\emph{regardless} of the \bibtex\ key you may have chosen.
The same citations should also work with your coauthors' bibliographies, even if they are keyed differently. \item \texttt{nbibfind} uses the nbib search engine on the command line. If you know you are looking for a paper by Harper and Moggi, you can just type \begin{verbatim} nbibfind harper-moggi \end{verbatim} and see what comes out. \item To help you work with coauthors who don't have the nbib package, \texttt{nbibmake}\footnote {Not yet implemented.} examines a {\LaTeX} document and builds a custom \texttt{.bib} file just for that document. \end{itemize}

\noindent The package is written in a combination of~C and Lua: \begin{itemize} \item Because I want nbib to be able to handle bibliographies with thousands or tens of thousands of entries, the code to parse a \texttt{.bib} database'' is written in~C. A~computer bought in 2003 can parse over 15,000~entries per second. \item Because the search for \bibtex\ entries requires string searching on every entry, the string search is also written in~C (and uses Boyer-Moore). \item Because string manipulation is much more easily done in Lua, all the code that converts a \bibtex\ entry into printed matter is written in Lua, as is all thedriver'' code that implements various programs. \end{itemize} The net result is that \texttt{nbibtex} is about five times slower than classic \texttt{bibtex}. This slowdown is easy to observe when printing a bibliography of several thousand entries, but on a typical paper with fewer than fifty citations and a personal bibliography with a thousand entries, the pause is imperceptible.

\subsection{Compatibility}

I've made every effort to make \nbibtex\ compatible with \bibtex, so that \nbibtex\ can be used on existing papers and should produce the same output as \bibtex. Regrettably, compatibility means avoiding modern treatment of non-ASCII characters, such as are found in the ISO Latin-1 character set: classic \bibtex\ simply treats every non-ASCII character as a letter. \begin{itemize} \item It would be pleasant to try instead to set \nbibtex\ to use an ISO~8859-1 locale, but this leads to incompatible output: \nbibtex\ forces characters to lower case that \bibtex\ leaves alone. <>= do local locales = { "en_US", "en_AU", "en_CA", "en_GB", "fr_CA", "fr_CH", "frFR", } for , l in pairs(locales) do if os.setlocale(l .. '.iso88591', 'ctype') then break end end end @ \item A much less pleasant alternative would be to abandon the support that Lua provides for distinguishing letters from nonletters and instead to try to do some sort of system-dependent character classification, as is done in \bibtex. I~don't have the stomach for it. \item The most principled solution I~can imagine would be to define a special ``\bibtex\ locale,'' whose sole purpose would be to guarantee compatibility with \bibtex. But this potential solution looks like a nightmare for software distribution. \item What I've done is proceed blithely with the user's current locale, throwing in a hack here or there as needed to guarantee compatibility with the test cases I~have in the default locale I~happen to use. The most notable case is [[bst.purify]], which is used to generate keys for sorting. \end{itemize} Expedience carries the day. Feh.

@

\section{Parsing \texttt{.bib} files}

This section reads the \texttt{.bib} file(s). <>=

include

include

include

include

include

include

include

include

<> <> <> <> <> <> @ \subsection{Internal interfaces}

\subsubsection {Data structures}

For convenience in keeping function prototypes uncluttered, all state associated with reading a particular \bibtex\ file is stored in a single [[Bibreader]] abstraction. That state is divided into three groups: \begin{itemize} \item Fields that say what file we are reading and what is our position within that file \item A~buffer that holds one line of the \texttt{.bib} file currently being scanned \item State accessible from Lua: an interpreter; a list of strings from the \texttt{.bib} preamble, which is exposed to the client; a warning function provided by the client; and a macro table provided by the client and updated by [[@string]] commands \end{itemize} In the buffer, the meaningful characters are in the half-open interval $[{}[[buf]], [[lim]])$, and we reserve space for a sentinel at~[[lim]]. The invariant is that $[[buf]] \le [[cur]] < [[lim]]$ and $[[buf]]+[[bufsize]] \ge [[lim]]+1$. <>= typedef struct bibreader { const char filename; / name of the .bib file / FILE *file; / .bib file open for read _/ int linenum; / line number of the .bib file _/ int entryline; / line number of last seen entry start */

unsigned char buf, cur, lim; / input buffer / unsigned bufsize; / size of buffer _/ char entryclose; / character expected to close current entry */

lua_State L; int preamble; / reference to preamble list of strings / int warning; / reference to universal warning function / int macros; / reference to macro table _/ } *Bibreader; @ The [[is_id_char]] array is used to define a predicate that says whether a character is considered part of an identifier. <>= bool is_idchar[256]; / needs initialization _/

define concatchar '#' / used to concatenate parts of a field defn */

@

\subsubsection {Scanning} Most internal functions are devoted to some form of scanning. The model is a bit like Icon: scanning may succeed or fail, and it has a side effect on the state of the reader---in particular the value of the [[cur]] pointer, and possibly also the contents of the buffer. (Unlike Icon, there is no backtracking.) Success or failure is nonzero or zero but is represented using type [[bool]]. <>= typedef int bool; @ Function [[getline]] refills the buffer with a new line (and updates [[line_num]]), returning failure on end of file. <>= static bool getline(Bibreader rdr); @ Several scanning functions come in two flavors, which depend on what happends at the end of a line: the [[_getline]] flavor refills the buffer and keeps scanning; the normal flavor fails. Here are some functions that scan for combinations of particular characters, whitespace, and nonwhite characters. <>= static bool upto1(Bibreader rdr, char c); static bool upto1_getline(Bibreader rdr, char c); static void upto_white_or_1(Bibreader rdr, char c); static void upto_white_or_2(Bibreader rdr, char c1, char c2); static void upto_white_or_3(Bibreader rdr, char c1, char c2, char c3); static bool upto_nonwhite(Bibreader rdr); static bool upto_nonwhite_getline(Bibreader rdr); @ Because there is always whitespace at the end of a line, the [[uptowhite]] flavor cannot fail. @ Some more sophisticated scanning functions. None attempts to return a value; instead each functions scans past the token in question, which the client can then find between the old and new values of the [[cur]] pointer. <>= static bool scan_identifier (Bibreader rdr, char c1, char c2, char c3); static bool scan_nonneg_integer (Bibreader rdr, unsigned np); @ Continuing from low to high level, here are functions used to scan fields, about which more below: <>= static bool scan_and_buffer_a_field_token (Bibreader rdr, int key, luaL_Buffer b); static bool scan_balanced_braces(Bibreader rdr, char close, luaL_Buffer b); static bool scan_and_push_the_field_value (Bibreader rdr, int key); @ Two utility functions used after scanning: The [[lower_case]] function overwrites buffer characters with their lowercase equivalents. The [[strip_leading_and_trailing_space]] functions removes leading and trailing space characters from a string on top of the Lua stack. <>= static void lower_case(unsigned char p, unsigned char lim); static void strip_leading_and_trailing_space(lua_State L); @ \subsubsection{Other functions} <>= static int get_bib_command_or_entry_and_process(Bibreader rdr); int luaopen_bibtex (lua_State L); @ \subsubsection{Commands}

In addition to database entries, a \texttt{.bib} file may contain the [[comment]], [[preamble]], and [[string]] commands. Each is implemented by a function of type [[Command]], which is associated with the name by [[find_command]]. <>= typedef bool (Command)(Bibreader); static Command find_command(unsigned char p, unsigned char *lim); static bool do_comment (Bibreader rdr); static bool do_preamble(Bibreader rdr); static bool do_string (Bibreader rdr); @

\subsubsection{Error handling}

The [[warnv]] function is used to call the warning function supplied by the Lua client. In addition to the reader, it takes as arguments the number of results expected and the signature of the arguments. (The warning function may receive any combination of string~([[s]]), floating-point~([[f]]), and integer~([[d]]) arguments; the [[fmt]] string gives the sequence of the arguments that follow.) <>= static void warnv(Bibreader rdr, int nres, const char _fmt, ...); @ There's a lot of crap here to do with reporting errors. An error in a function called direct from Lua pushes [[false]] and a message and returns~[[2]]; an error in a boolean function pushes the same but returns failure to its caller. I~hope to replace this code with native Lua error handling ([[lua_error]]). <>=

define LERRPUSH(S) do { \

if (!lua_checkstack(rdr->L, 10)) assert(0); \ lua_pushboolean(rdr->L, 0); \ lua_pushfstring(rdr->L, "%s, line %d: ", rdr->filename, rdr->line_num); \ lua_pushstring(rdr->L, S); \ lua_concat(rdr->L, 2); \ } while(0)

define LERRFPUSH(S,A) do { \

if (!lua_checkstack(rdr->L, 10)) assert(0); \ lua_pushboolean(rdr->L, 0); \ lua_pushfstring(rdr->L, "%s, line %d: ", rdr->filename, rdr->line_num); \ lua_pushfstring(rdr->L, S, A); \ lua_concat(rdr->L, 2); \ } while(0)

define LERR(S) do { LERRPUSH(S); return 2; } while(0)

define LERRF(S,A) do { LERRFPUSH(S,A); return 2; } while(0)

/_ next: cases for Boolean functions */

define LERRB(S) do { LERRPUSH(S); return 0; } while(0)

define LERRFB(S,A) do { LERRFPUSH(S,A); return 0; } while(0)

@ \subsection{Reading a database entry}

Syntactically, a \texttt{.bib} file is a sequence of entries, perhaps with a few \texttt{.bib} commands thrown in. Each entry consists of an at~sign, an entry type, and, between braces or parentheses and separated by commas, a database key and a list of fields. Each field consists of a field name, an equals sign, and nonempty list of field tokens separated by [[concat_char]]s. Each field token is either a nonnegative number, a macro name (like `jan'), or a brace-balanced string delimited by either double quotes or braces. Finally, case differences are ignored for all but delimited strings and database keys, and whitespace characters and ends-of-line may appear in all reasonable places (i.e., anywhere except within entry types, database keys, field names, and macro names); furthermore, comments may appear anywhere between entries (or before the first or after the last) as long as they contain no at~signs.

This function reads a database entry and pushes it on the Lua stack. Any commands encountered before the database entry are executed. If no entry remains, the function returns~0. <>=

undef ready_tok

define ready_tok(RDR) do { \

if (!upto_nonwhite_getline(RDR)) \ LERR("Unexpected end of file"); \ } while(0)

static int get_bib_command_or_entry_and_process(Bibreader rdr) { unsigned char _id, *key; int keyindex; bool (_command)(Bibreader); getnext: <<scan [[rdr]] up to and past the next [[@]] sign and skip white space (or return 0)>>

id = rdr->cur; if (!scan_identifier (rdr, '{', '(', '(')) LERR("Expected an entry type"); lower_case (id, rdr->cur); /* ignore case differences */ <<if $[{}[[id]], \mbox{[[rdr->cur]]})$ points to a command, execute it and go to [[getnext]]>>

luapushlstring(rdr->L, (char ) id, rdr->cur - id); / push entry type / rdr->entry_line = rdr->line_num; ready_tok(rdr); <<scan past opening delimiter and set [[rdr->entry_close]]>> ready_tok(rdr); key = rdr->cur; <<set [[rdr->cur]] to next whitespace, comma, or possibly [[}]]>> luapushlstring(rdr->L, (char *) key, rdr->cur - key); / push database key _/ keyindex = lua_gettop(rdr->L); luanewtable(rdr->L); / push table of fields _/ ready_tok(rdr); for (; rdr->cur != rdr->entry_close; ) { <<absorb comma (breaking if followed by [[rdr->entry_close]])>> <<read a field-value pair and set it in the field table, which is on top of the Lua stack>> readytok(rdr); } rdr->cur++; / skip past close of entry / return 3; / entry type, key, table of fields _/ } @ <<scan [[rdr]] up to and past the next [[@]] sign and skip white space (or return 0)>>= if (!upto1getline(rdr, '@')) return 0; / no more entries; return nil _/ assert(rdr->cur == '@'); rdr->cur++; /* skip the @ sign / ready_tok(rdr); @ <<if $[{}[[id]], \mbox{[[rdr->cur]]})$ points to a command, execute it and go to [[getnext]]>>= command = findcommand(id, rdr->cur); if (command) { if (!command(rdr)) return 2; / command put (false, message) on Lua stack; we're done _/ goto getnext; } @ An entry is delimited either by braces or by brackets; in order to recognize the correct closing delimiter, we put it in [[rdr->entry_close]]. <<scan past opening delimiter and set [[rdr->entry_close]]>>= if (_rdr->cur == '{') rdr->entry_close = '}'; else if (_rdr->cur == '(') rdr->entry_close = ')'; else LERR("Expected entry to open with { or ("); rdr->cur++; @ I'm not quite sure why stopping at~[[}]] is conditional on the closing delimiter in this way. <<set [[rdr->cur]] to next whitespace, comma, or possibly [[}]]>>= if (rdr->entry_close == '}') { upto_white_or_1(rdr, ','); } else { upto_white_or_2(rdr, ',', '}'); } @ At this point we're at a nonwhite token that is not the closing delimiter. If it's not a comma, there's big trouble---but even if it is, the database may be using comma as a terminator, in which case a closing delimiter signals the end of the entry. <<absorb comma (breaking if followed by [[rdr->entry_close]])>>= if (_rdr->cur == ',') { rdr->cur++; ready_tok(rdr); if (_rdr->cur == rdr->entry_close) { break; } } else { LERR("Expected comma or end of entry"); } @ The syntax for a field is \emph{identifier}\texttt{=}\emph{value}. The field name is forced to lower case. <<read a field-value pair and set it in the field table, which is on top of the Lua stack>>= if (id = rdr->cur, !scan_identifier (rdr, '=', '=', '=')) LERR("Expected a field name"); lower_case(id, rdr->cur); lua_pushlstring(rdr->L, (char ) id, rdr->cur - id); / push field name / ready_tok(rdr); if (_rdr->cur != '=') LERR("Expected '=' to follow field name"); rdr->cur++; /* skip over the [['=']] */ ready_tok(rdr); if (!scan_and_push_the_field_value(rdr, keyindex)) return 2; strip_leading_and_trailing_space(rdr->L); <<if field is not already set, set it; otherwise warn>> @ Official \bibtex\ does not permit duplicate entries for a single field. But in entries on the net, you see lots of such duplicates in such unofficial fields as \texttt{reffrom}. Because classic \bibtex\ doesn't report errors on fields that aren't advertised by the \texttt{.bst} file, we don't want to just blat out a whole bunch of warning messages. So instead we dump the problem on the warning function provided by the Lua client.

We therefore can't simply set the field in the field table: we first look it up, and if it is nil, we set it; otherwise we warn. <<if field is not already set, set it; otherwise warn>>= luapushvalue(rdr->L, -2); /* push key / lua_gettable(rdr->L, -4); if (lua_isnil(rdr->L, -1)) { lua_pop(rdr->L, 1); lua_settable(rdr->L, -3); } else { luapop(rdr->L, 1); / off comes old value / warnv(rdr, 0, "ssdsss", / tag, file, line, cite-key, field, newvalue _/ "extra field", rdr->filename, rdr->line_num, lua_tostring(rdr->L, keyindex), lua_tostring(rdr->L, -2), lua_tostring(rdr->L, -1)); luapop(rdr->L, 2); / off come key and new value */ } @ \subsection{Scanning functions}

\subsubsection{Scanning functions for fields} @ While scanning fields, we are not operating in a toplevel function, so the error handling for [[ready_tok]] needs to be a bit different. <>=

undef ready_tok

define ready_tok(RDR) do { \

if (!upto_nonwhite_getline(RDR)) \ LERRB("Unexpected end of file"); \ } while(0) @ Each field value is accumulated into a [[luaL_Buffer]] from the Lua auxiliary library. The buffer is always called~[[b]]; for conciseness, we use the macro [[copy_char]] to add a character to it. <>=

define copy_char(C) luaL_putchar(b, (C))

@ A field value is a sequence of one or more tokens separated by a [concat_char]. A~precondition for calling [[scan_and_push_the_field_value]] is that [[rdr]] is pointing at a nonwhite character. <>= static bool scan_and_push_the_field_value (Bibreader rdr, int key) { luaL_Buffer field;

luaL_checkstack(rdr->L, 10, "Not enough Lua stack to parse bibtex database"); luaL_buffinit(rdr->L, &field); for (;;) { if (!scan_and_buffer_a_field_token(rdr, key, &field)) return 0; ready_tok(rdr); /* cur now points to [[concatchar]] or end of field / if (_rdr->cur != concat_char) break; else { rdr->cur++; ready_tok(rdr); } } luaL_pushresult(&field); return 1; } @ Because [[ready_tok]] can [[return]] in case of error, we can't write \begin{quote} [[for(; _rdr->cur == concat_char; rdr->cur++, ready_tok(rdr))]]. \end{quote} @ A field token is either a nonnegative number, a macro name (like `jan'), or a brace-balanced string delimited by either double quotes or braces. Thus there are four possibilities for the first character of the field token: If it's a left brace or a double quote, the token (with balanced braces, up to the matchin closing delimiter) is a string; if it's a digit, the token is a number; if it's anything else, the token is a macro name (and should thus have been defined by either the \texttt{.bst}-file's \texttt{macro} command or the \texttt{.bib}-file's \texttt{string} command). This function returns [[false]] if there was a serious syntax error. <>= static bool scan_and_buffer_a_field_token (Bibreader rdr, int key, luaL_Buffer b) { unsigned char p; unsigned number; rdr->lim = ' '; switch (_rdr->cur) { case '{': case '"': return scan_balanced_braces(rdr, _rdr->cur == '{' ? '}' : '"', b); case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': p = rdr->cur; scan_nonneg_integer(rdr, &number); luaL_addlstring(b, (char )p, rdr->cur - p); return 1; default: / named macro / p = rdr->cur; if (!scan_identifier(rdr, ',', rdr->entry_close, concat_char)) LERRB("Expected a field part"); lowercase (p, rdr->cur); / ignore case differences / / missing warning of macro name used in its own definition _/ luapushlstring(rdr->L, (char *) p, rdr->cur - p); / stack: name _/ luagetref(rdr->L, rdr->macros); / stack: name macros _/ luainsert(rdr->L, -2); / stack: name macros name _/ luagettable(rdr->L, -2); / stack: name defn _/ luaremove(rdr->L, -2); / stack: defn _/ <<if top of stack is nil, pop it and warn of undefined macro; else buffer it>> return 1; } } @ Here's another warning that's kicked out to the client. Reason: standard \bibtex\ complains only if it intends to use the entry in question. <<if top of stack is nil, pop it and warn of undefined macro; else buffer it>>= { int t = lua_gettop(rdr->L); if (lua_isnil(rdr->L, -1)) { lua_pop(rdr->L, 1); luapushlstring(rdr->L, (char *) p, rdr->cur - p); warnv(rdr, 1, "ssdss", / tag, file, line, key, macro _/ "undefined macro", rdr->filename, rdr->line_num, key ? lua_tostring(rdr->L, key) : NULL, lua_tostring(rdr->L, -1)); if (lua_isstring(rdr->L, -1)) luaL_addvalue(b); else lua_pop(rdr->L, 1); lua_pop(rdr->L, 1); } else { luaL_addvalue(b); } assert(lua_gettop(rdr->L) == t-1); } @ This \texttt{.bib}-specific function scans and buffers a string with balanced braces, stopping just past the matching [[close]].
The original \bibtex\ tries to optimize the common case of a field with no internal braces; I~don't. A~precondition for calling this function is that [[rdr->cur]] point at the opening delimiter. Whitespace is compressed to a single space character. <>= static int scan_balanced_braces(Bibreader rdr, char close, luaLBuffer b) { unsigned char p, *cur, c; int braces = 0; / number of currently open braces inside string */

rdr->cur++; /* scan past left delimiter _/ rdr->lim = ' '; if (isspace(_rdr->cur)) { copy_char(' '); ready_tok(rdr); } for (;;) { p = rdr->cur; upto_white_or3(rdr, '}', '{', close); cur = rdr->cur; for ( ; p < cur; p++) /* copy nonwhite, nonbrace characters / copy_char(_p); _rdr->lim = ' '; c = cur; / will be whitespace if at end of line / <<depending on [[c]], return or adjust [[braces]] and continue>> } } @ Beastly complicated: \begin{itemize} \item Space is compressed and scanned past. \item A closing delimiter ends the scan at brace level~0 and otherwise is buffered. \item Braces adjust the [[braces]] count. \end{itemize} <<depending on [[c]], return or adjust [[braces]] and continue>>= if (isspace(c)) { copy_char(' '); ready_tok(rdr); } else { rdr->cur++; if (c == close) { if (braces == 0) { luaL_pushresult(b); return 1; } else { copy_char(c); if (c == '}') braces--; } } else if (c == '{') { braces++; copy_char(c); } else { assert(c == '}'); if (braces > 0) { braces--; copy_char(c); } else { luaLpushresult(b); / restore invariant _/ LERRB("Unexpected '}'"); } } } @ \subsubsection {Low-level scanning functions} Scan the reader up to the character requested or end of line; fails if not found. <>= static bool upto1(Bibreader rdr, char c) { unsigned char p = rdr->cur; unsigned char lim = rdr->lim; lim = c; while (_p != c) p++; rdr->cur = p; return p < lim; } @ Scan the reader up to the character requested or end of file; fails if not found. <>= static int upto1_getline(Bibreader rdr, char c) { while (!upto1(rdr, c)) if (!getline(rdr)) return 0; return 1; } @ Scan the reader up to the next whitespace or the one character requested. Always succeeds, because the end of the line is whitespace. <>= static void upto_white_or_1(Bibreader rdr, char c) { unsigned char _p = rdr->cur; unsigned char lim = rdr->lim; lim = c; while (_p != c && !isspace(_p)) p++; rdr->cur = p; } @ Scan the reader up to the next whitespace or either of two characters requested. <>= static void upto_white_or_2(Bibreader rdr, char c1, char c2) { unsigned char p = rdr->cur; unsigned char lim = rdr->lim; lim = c1; while (_p != c1 && _p != c2 && !isspace(_p)) p++; rdr->cur = p; } @ Scan the reader up to the next whitespace or any of three characters requested. <>= static void upto_white_or_3(Bibreader rdr, char c1, char c2, char c3) { unsigned char _p = rdr->cur; unsigned char lim = rdr->lim; lim = c1; while (!isspace(_p) && _p != c1 && p != c2 && p != c3) p++; rdr->cur = p; } @ This function scans over whitespace characters, stopping either at the first nonwhite character or the end of the line, respectively returning [[true]] or [[false]]. <>= static bool upto_nonwhite(Bibreader rdr) { unsigned char p = rdr->cur; unsigned char lim = rdr->lim; lim = 'x'; while (isspace(_p)) p++; rdr->cur = p; return p < lim; } @ Scan past whitespace up to end of file if needed; returns true iff nonwhite character found. <>= static int upto_nonwhite_getline(Bibreader rdr) { while (!upto_nonwhite(rdr)) if (!getline(rdr)) return 0; return 1; } @ \subsubsection{Actual input} <>= static bool getline(Bibreader rdr) { char _result; unsigned char buf = rdr->buf; int n; result = fgets((char )buf, rdr->bufsize, rdr->file); if (result == NULL) return 0; rdr->line_num++; for (n = strlen((char )buf); buf[n-1] != '\n'; n = strlen((char )buf)) { / failed to get whole line / rdr->bufsize = 2; buf = rdr->buf = realloc(rdr->buf, rdr->bufsize); assert(buf); if (fgets((char )buf+n,rdr->bufsize-n,rdr->file)==NULL) { n = strlen((char )buf) + 1; / -1 below is incorrect without newline / break; / file ended without a newline / } } rdr->cur = buf; rdr->lim = buf+n-1; /_ trailing newline not in string */ return 1; } @ \subsubsection{Medium-level scanning functions}

This procedure scans for an identifier, stopping at the first [[illegal_id_char]], or stopping at the first character if it's [[numeric]]. It sets the global variable [[scan_result]] to [[id_null]] if the identifier is null, else to [[white_adjacent]] if it ended at a whitespace character or an end-of-line, else to [[specified_char_adjacent]] if it ended at one of [[char1]] or [[char2]] or [[char3]], else to [[other_char_adjacent]] if it ended at a nonspecified, nonwhitespace [[illegal_id_char]]. By convention, when some calling code really wants just one or two ``specified'' characters, it merely repeats one of the characters. <>= static int scan_identifier (Bibreader rdr, char c1, char c2, char c3) { unsigned char p, orig, c;

orig = p = rdr->cur; if (!isdigit(p)) { / scan until end-of-line or an [[illegal_idchar]] / rdr->lim = ' '; / an illegal id character and also white space / while (is_id_char[_p]) p++; } c = _p; if (p > rdr->cur && (isspace(c) || c == c1 || c == c2 || c == c3)) { rdr->cur = p; return 1; } else { return 0; } } @ This function scans for a nonnegative integer, stopping at the first nondigit; it writes the resulting integer through [[np]].
It returns [[true]] if the token was a legal nonnegative integer (i.e., consisted of one or more digits). <>= static bool scan_nonneg_integer (Bibreader rdr, unsigned
np) { unsigned char p = rdr->cur; unsigned n = 0; rdr->lim = ' '; / sentinel / while (isdigit(_p)) { n = n * 10 + (p - '0'); p++; } if (p == rdr->cur) return 0; / no digits _/ else { rdr->cur = p; np = n; return 1; } } @ This procedure scans for an integer, stopping at the first nondigit; it sets the value of [[token_value]] accordingly. It returns [[true]] if the token was a legal integer (i.e., consisted of an optional [[minus_sign]] followed by one or more digits). <>= static bool scan_integer (Bibreader rdr) { unsigned char p = rdr->cur; int n = 0; int sign = 0; / number of characters of sign / rdr->lim = ' '; / sentinel / if (_p == '-') { sign = 1; p++; } while (isdigit(_p)) { n = n * 10 + (p - '0'); p++; } if (p == rdr->cur) return 0; /* no digits / else { rdr->cur = p; return 1; } } @ \subsection{C~utility functions} @ <>= static void lower_case(unsigned char p, unsigned char lim) { for (; p < lim; p++) p = tolower(_p); } @ <>= static void strip_leading_and_trailing_space(lua_State _L) { const char p; int n; assert(lua_isstring(L, -1)); p = lua_tostring(L, -1); n = lua_strlen(L, -1); if (n > 0 && (isspace(_p) || isspace(p[n-1]))) { while(n > 0 && isspace(p)) p++, n--; while(n > 0 && isspace(p[n-1])) n--; lua_pushlstring(L, p, n); lua_remove(L, -2); } } @ \subsection{Implementations of the \bibtex\ commands}

On encountering an [[@]]\emph{identifier}, we ask if the \emph{identifier} stands for a command and if so, return that command. <>= static Command find_command(unsigned char _p, unsigned char *lim) { int n = lim - p; assert(lim > p);

define match(S) (!strncmp(S, (char *)p, n) && (S)[n] == '\0')

switch(_p) { case 'c' : if (match("comment")) return do_comment; else break; case 'p' : if (match("preamble")) return do_preamble; else break; case 's' : if (match("string")) return do_string; else break; } return (Command)0; } @ %% \webindexsort{database-file commands}{\quad \texttt{comment}} The \texttt{comment} command is implemented for SCRIBE compatibility. It's not really needed because \BibTeX\ treats (flushes) everything not within an entry as a comment anyway. <>= static bool do_comment(Bibreader rdr) { return 1; } @ %% \webindexsort{database-file commands}{\quad \texttt{preamble}} The \texttt{preamble} command lets a user have \TeX\ stuff inserted (by the standard styles, at least) directly into the \texttt{.bbl} file. It is intended primarily for allowing \TeX\ macro definitions used within the bibliography entries (for better sorting, for example). One \texttt{preamble} command per \texttt{.bib} file should suffice.

A \texttt{preamble} command has either braces or parentheses as outer delimiters. Inside is the preamble string, which has the same syntax as a field value: a nonempty list of field tokens separated by [[concat_char]]s. There are three types of field tokens---nonnegative numbers, macro names, and delimited strings.

This module does all the scanning (that's not subcontracted), but the \texttt{.bib}-specific scanning function [[scan_and_push_the_field_value_and_eat_white]] actually stores the value. <>= static bool do_preamble(Bibreader rdr) { ready_tok(rdr); <<scan past opening delimiter and set [[rdr->entry_close]]>> ready_tok(rdr); lua_rawgeti(rdr->L, LUA_REGISTRYINDEX, rdr->preamble); lua_pushnumber(rdr->L, lua_objlen(rdr->L, -1) + 1); if (!scan_and_push_the_field_value(rdr, 0)) return 0; ready_tok(rdr); if (_rdr->cur != rdr->entry_close) LERRFB("Missing '%c' in preamble command", rdr->entry_close); rdr->cur++; lua_settable(rdr->L, -3); luapop(rdr->L, 1); / remove preamble */ return 1; } @ %% \webindexsort{database-file commands}{\quad \texttt{string}} The \texttt{string} command is implemented both for SCRIBE compatibility and for allowing a user: to override a \texttt{.bst}-file \texttt{macro} command, to define one that the \texttt{.bst} file doesn't, or to engage in good, wholesome, typing laziness.

The \texttt{string} command does mostly the same thing as the \texttt{.bst}-file's \texttt{macro} command (but the syntax is different and the \texttt{string} command compresses white space). In fact, later in this program, the term macro'' refers to either a \texttt{.bst}macro'' or a \texttt{.bib} ``string'' (when it's clear from the context that it's not a \texttt{WEB} macro).

A \texttt{string} command has either braces or parentheses as outer delimiters. Inside is the string's name (it must be a legal identifier, and case differences are ignored---all upper-case letters are converted to lower case), then an equals sign, and the string's definition, which has the same syntax as a field value: a nonempty list of field tokens separated by [[concat_char]]s. There are three types of field tokens---nonnegative numbers, macro names, and delimited strings. <>= static bool do_string(Bibreader rdr) { unsigned char _id; int keyindex; ready_tok(rdr); <<scan past opening delimiter and set [[rdr->entry_close]]>> ready_tok(rdr); id = rdr->cur; if (!scan_identifier(rdr, '=', '=', '=')) LERRB("Expected a string name followed by '='"); lower_case(id, rdr->cur); lua_pushlstring(rdr->L, (char )id, rdr->cur - id); keyindex = lua_gettop(rdr->L); ready_tok(rdr); if (_rdr->cur != '=') LERRB("Expected a string name followed by '='"); rdr->cur++; ready_tok(rdr); if (!scan_and_push_the_field_value(rdr, keyindex)) return 0; ready_tok(rdr); if (rdr->cur != rdr->entry_close) LERRFB("Missing '%c' in macro definition", rdr->entry_close); rdr->cur++; lua_getref(rdr->L, rdr->macros); lua_insert(rdr->L, -3); lua_settable(rdr->L, -3); lua_pop(rdr->L, 1); return 1; } @ \subsection{Interface to Lua}

First, we define Lua access to a reader. <>=
static Bibreader checkreader(lua_State L, int index) { return luaL_checkudata(L, index, "bibtex.reader"); } @ The reader's [[__index]] metamethod provides access to the [[entry_line]] and [[preamble]] values as if they were fields of the Lua table.
It also provides access to the [[next]] and [[close]] methods of the reader object. <>= static int reader_meta_index(lua_State
L) { Bibreader rdr = checkreader(L, 1); const char key; if (!lua_isstring(L, 2)) return 0; key = lua_tostring(L, 2); if (!strcmp(key, "next")) lua_pushcfunction(L, next_entry); else if (!strcmp(key, "entry_line")) lua_pushnumber(L, rdr->entry_line); else if (!strcmp(key, "preamble")) lua_rawgeti(L, LUA_REGISTRYINDEX, rdr->preamble); else if (!strcmp(key, "close")) lua_pushcfunction(L, closereader); else lua_pushnil(L); return 1; } @ Here are the functions exported in the [[bibtex]] module: <>= static int openreader(lua_State L); static int next_entry(lua_State L); static int closereader(lua_State L); <>= static const struct luaL_reg bibtexlib [] = { {"open", openreader}, {"close", closereader}, {"next", next_entry}, {NULL, NULL} }; @ \newcommand\nt[1]{\rmfamily{\emph{#1}}} \newcommand\optional[1]{\rmfamily{[}#1\rmfamily{]}}

To create a reader, we call \begin{quote} \texttt{openreader(\nt{filename}, \optional{\nt{macro-table}, \optional{\nt{warn-function}}})} \end{quote}

The warning function will be called in one of the following ways: \begin{itemize} \item warn([["extra field"]], \emph{file}, \emph{line}, \emph{citation-key}, \emph{field-name}, \emph{field-value})

Duplicate definition of a field in a single entry. \item warn([["undefined macro"]], \emph{file}, \emph{line}, \emph{citation-key}, \emph{macro-name})

Use of an undefined macro. \end{itemize} <>=

define INBUF 128 /* initial size of input buffer _/

/_ filename * macro table * warning function -> reader / static int openreader(lua_State L) { const char filename = luaL_checkstring(L, 1); FILE f = fopen(filename, "r"); Bibreader rdr; if (!f) { lua_pushnil(L); lua_pushfstring(L, "Could not open file '%s'", filename); return 2; }

<<set items 2 and 3 on stack to hold macro table and optional warning function>>

rdr = lua_newuserdata(L, sizeof(*rdr)); luaL_getmetatable(L, "bibtex.reader"); lua_setmetatable(L, -2);

rdr->line_num = 0; rdr->buf = rdr->cur = rdr->lim = malloc(INBUF); rdr->bufsize = INBUF; rdr->file = f; rdr->filename = malloc(lua_strlen(L, 1)+1); assert(rdr->filename); strncpy((char *)rdr->filename, filename, lua_strlen(L, 1)+1); rdr->L = L; lua_newtable(L); rdr->preamble = luaL_ref(L, LUA_REGISTRYINDEX); lua_pushvalue(L, 2); rdr->macros = luaL_ref(L, LUA_REGISTRYINDEX); lua_pushvalue(L, 3); rdr->warning = luaL_ref(L, LUA_REGISTRYINDEX); return 1; } @ <<set items 2 and 3 on stack to hold macro table and optional warning function>>= if (lua_type(L, 2) == LUA_TNONE) lua_newtable(L);

if (lua_type(L, 3) == LUA_TNONE) lua_pushnil(L); else if (!lua_isfunction(L, 3)) luaL_error(L, "Warning value to bibtex.open is not a function"); @
Reader method [[next_entry]] takes no parameters. On success it returns a triple (\emph{type}, \emph{key}, \emph{field-table}). On error it returns (\texttt{false}, \emph{message}). On end of file it returns nothing. <>= static int next_entry(lua_State _L) { Bibreader rdr = checkreader(L, 1); if (!rdr->file) luaL_error(L, "Tried to read from closed bibtex.reader"); return get_bib_command_or_entry_and_process(rdr); }
@ Closing a reader recovers its resources; the [[file]] field of a closed reader is [[NULL]]. <>= static int closereader(lua_State L) { Bibreader rdr = checkreader(L, 1); if (!rdr->file) luaLerror(L, "Tried to close closed bibtex.reader"); fclose(rdr->file); rdr->file = NULL; free(rdr->buf); rdr->buf = rdr->cur = rdr->lim = NULL; rdr->bufsize = 0; free((void)rdr->filename); rdr->filename = NULL; rdr->L = NULL; luaL_unref(L, LUA_REGISTRYINDEX, rdr->preamble); rdr->preamble = 0; luaL_unref(L, LUA_REGISTRYINDEX, rdr->warning); rdr->warning = 0; luaL_unref(L, LUA_REGISTRYINDEX, rdr->macros); rdr->macros = 0; return 0; }
@ To help implement the call to the warning function, we have [[warnv]]. If there is no warning function, we return the nubmer of nils specified by [[nres]]. <>= static void warnv(Bibreader rdr, int nres, const char
fmt, ...) { const char *p; va_list vl;

lua_rawgeti(rdr->L, LUA_REGISTRYINDEX, rdr->warning); if (lua_isnil(rdr->L, -1)) { lua_pop(rdr->L, 1); while (nres-- > 0) lua_pushnil(rdr->L); } else { va_start(vl, fmt); for (p = fmt; _p; p++) switch (_p) { case 'f': lua_pushnumber(rdr->L, va_arg(vl, double)); break; case 'd': lua_pushnumber(rdr->L, va_arg(vl, int)); break; case 's': { const char _s = va_arg(vl, char ); if (s == NULL) lua_pushnil(rdr->L); else lua_pushstring(rdr->L, s); break; } default: luaL_error(rdr->L, "invalid parameter type %c", p); } lua_call(rdr->L, p - fmt, nres); va_end(vl); } } @ Here's where the library is initialized. This is the only exported function in the whole file. <>= int luaopen_bibtex (lua_State L) { luaL_newmetatable(L, "bibtex.reader"); lua_pushstring(L, "__index"); lua_pushcfunction(L, reader_metaindex); / pushes the index method _/ luasettable(L, -3); / metatable.__index = metatable /

luaL_register(L, "bibtex", bibtexlib); <<initialize the [[is_id_char]] table>> return 1; } @ In an identifier, we can accept any printing character except the ones listed in the [[nonids]] string. <<initialize the [[is_id_char]] table>>= { unsigned c; static unsigned char nonids = (unsigned char )"\"#%'(),={} \t\n\f"; unsigned char *p;

for (c = 0; c <= 0377; c++) is_id_char[c] = 1; for (c = 0; c <= 037; c++) is_id_char[c] = 0; for (p = nonids; _p; p++) is_id_char[_p] = 0; } @ \subsection{Main function for the nbib commands}

This code will is the standalone main function for all the nbib commands. \nextchunklabel{c-main} <>=

include

include

include

include

include

extern int luaopen_bibtex(lua_State L); extern int luaopen_boyer_moore (lua_State L);

int main (int argc, char _argv[]) { int i, rc; lua_State *L = luaLnewstate(); static const char files[] = { SHARE "/bibtex.lua", SHARE "/natbib.nbs" };

define OPEN(N) luapushcfunction(L, luaopen ## N); lua_call(L, 0, 0)

OPEN(base); OPEN(table); OPEN(io); OPEN(package); OPEN(string); OPEN(bibtex); OPEN(boyer_moore);

for (i = 0; i < sizeof(files)/sizeof(files[0]); i++) { if (luaL_dofile(L, files[i])) { fprintf(stderr, "%s: error loading configuration file %s\n", argv[0], files[i]); exit(2); } } lua_pushstring(L, "bibtex"); lua_gettable(L, LUA_GLOBALSINDEX); lua_pushstring(L, "main"); lua_gettable(L, -2); lua_newtable(L); for (i = 0; i < argc; i++) { lua_pushnumber(L, i); lua_pushstring(L, argv[i]); lua_settable(L, -3); } rc = lua_pcall(L, 1, 0, 0); if (rc) { fprintf(stderr, "Call failed: %s\n", lua_tostring(L, -1)); lua_pop(L, 1); } lua_close(L); return rc; } @ \section{Implementation of \texttt{nbibtex}}

From here out, everything is written in Lua (\url{http://www.lua.org}). The main module is [[bibtex]], and style-file support is in the submodule [[bibtex.bst]]. Each has a [[doc]] submodule, which is intended as machine-readable documentation. <>= <<if not already present, load the C code for the [[bibtex]] module>>

local config = config or { } --- may be defined by config process

local workaround = { badbibs = true, --- don't look at bad .bib files that come with teTeX } local bst = { } bibtex.bst = bst

bibtex.doc = { } bibtex.bst.doc = { }

bibtex.doc.bst = '# table of functions used to write style files' @ Not much code is executed during startup, so the main issue is to manage declaration before use. I~have a few forward declarations in [[<>]]; otherwise, count only on utility'' functions being declared beforeexported'' ones. <>= local find = string.find <> <> <> <>

return bibtex @ The Lua code relies on the C~code. How we get the C~code depends on how \texttt{bibtex.lua} is used; there are two alternatives: \begin{itemize} \item In the distribution, \texttt{bibtex.lua} is loaded by the C~code in chunk~\subpageref{c-main}, which defines the [[bibtex]] module. \item For standalone testing purposes, \texttt{bibtex.lua} can be loaded directly into an interactive Lua interpreter, in which case it loads the [[bibtex]] module as a shared library. \end{itemize} <<if not already present, load the C code for the [[bibtex]] module>>= if not bibtex then local nbib = require 'nbib-bibtex' bibtex = nbib end @ \subsection{Error handling, warning messages, and logging} <>= local function printf (...) return io.stdout:write(string.format(...)) end local function eprintf(...) return io.stderr:write(string.format(...)) end @ I have to figure out what to do about errors --- the current code is bogus. Among other things, I should be setting error levels. <>= local function bibwarnf (...) eprintf(...); eprintf('\n') end local function biberrorf(...) eprintf(...); eprintf('\n') end local function bibfatalf(...) eprintf(...); eprintf('\n'); os.exit(2) end @ Logging? What logging? <>= local function logf() end @ \subsubsection{Support for delayed warnings}

Like classic \bibtex, \nbibtex\ typically warns only about entries that are actually used. This functionality is implemented by function [[hold_warning]], which keeps warnings on ice until they are either returned by [[held_warnings]] or thrown away by [[drop_warning]]. The function [[emit_warning]] emits a warning message eagerly when called; it is used to issue warnings about entries we actually use, or if the [[-strict]] option is given, to issue every warning. <>= local hold_warning -- function suitable to pass to bibtex.open; holds local emit_warning -- function suitable to pass to bibtex.open; prints local held_warnings -- returns nil or list of warnings since last call local drop_warnings -- drops warnings

local extra_ok = { reffrom = true } -- set of fields about which we should not warn of duplicates

do local warnfuns = { } warnfuns["extra field"] = function(file, line, cite, field, newvalue) if not extra_ok[field] then bibwarnf("Warning--I'm ignoring %s's extra \"%s\" field\n--line %d of file %s\n", cite, field, line, file) end end

warnfuns["undefined macro"] = function(file, line, cite, macro) bibwarnf("Warning--string name \"%s\" is undefined\n--line %d of file %s\n", macro, line, file) end

function emit_warning(tag, ...) return assert(warnfuns[tag])(...) end

local held function hold_warning(...) held = held or { } table.insert(held, { ... }) end function held_warnings() local h = held held = nil return h end function drop_warnings() held = nil end end @ \subsection{Miscellany} All this stuff is dubious. <>= function table.copy(t) local u = { } for k, v in pairs(t) do u[k] = v end return u end @ <>= local function open(f, m, what) local f, msg = io.open(f, m) if f then return f else (what or bibfatalf)('Could not open file %s: %s', f, msg) end end @ <>= local function entries(rdr, empty) assert(not empty) return function() return rdr:next() end end

bibtex.entries = entries bibtex.doc.entries = 'reader -> iterator # generate entries' @ \subsection{Internal documentation}

We attempt to document everything! <>= function bibtex:showdoc(title) local out = bst.writer(io.stdout, 5) local function outf(...) return out:write(string.format(...)) end local allkeys, dkeys = { }, { } for k, in pairs(self) do table.insert(allkeys, k) end for k, _ in pairs(self.doc) do table.insert(dkeys, k) end table.sort(allkeys) table.sort(dkeys) for i = 1, table.getn(dkeys) do outf("%s.%-12s : %s\n", title, dkeys[i], self.doc[dkeys[i]]) end local header for i = 1, table.getn(allkeys) do local k = allkeys[i] if k ~= "doc" and k ~= "show_doc" and not self.doc[k] then if not header then outf('Undocumented keys in table %s:', title) header = true end outf(' %s', k) end end if header then outf('\n') end end bibtex.bst.show_doc = bibtex.show_doc @
Here is the documentation for what's defined in C~code: <>= bibtex.doc.open = 'filename -> reader # open a reader for a .bib file' bibtex.doc.close = 'reader -> unit # close open reader' bibtex.doc.next = 'reader -> type * key * field table # read an entry' @ \subsection{Main function for \texttt{nbibtex}}

Actually, the same main function does for both \texttt{nbibtex} and \texttt{nbibfind}; depending on how the program is called, it delegates to [[bibtex.bibtex]] or [[bibtex.run_find]]. <>= bibtex.doc.main = 'string list -> unit # main program that dispatches on argv[0]' function bibtex.main(argv) if argv[1] == '-doc' then -- undocumented internal doco bibtex:show_doc('bibtex') bibtex.bst:show_doc('bst') elseif find(argv[0], 'bibfind$') then return bibtex.run_find(argv) elseif find(argv[0], 'bibtex$') then return bibtex.bibtex(argv) else error("Call me something ending in 'bibtex' or 'bibfind'; when called\n ".. argv[0]..", I don't know what to do") end end @ <>= local permissive = false -- nbibtex extension (ignore missing .bib files, etc.) local strict = false -- complain eagerly about errors in .bib files local min_crossrefs = 2 -- how many crossref's required to add an entry? local output_name = nil -- output file if not default local bib_out = false -- output .bib format

bibtex.doc.bibtex = 'string list -> unit # main program for nbibtex' function bibtex.bibtex(argv) <<set bibtex options from [[argv]]>> if table.getn(argv) < 1 then bibfatalf('Usage: %s [-permissive|-strict|...] filename[.aux] [bibfile...]', argv[0]) end local auxname = table.remove(argv, 1) local basename = string.gsub(string.gsub(auxname, '%.aux$', ''), '%.$', '') auxname = basename .. '.aux' local bblname = output_name or (basename .. '.bbl') local blgname = basename .. (output_name and '.nlg' or '.blg') local blg = open(blgname, 'w')

-- Here's what we accumulate by reading .aux files: local bibstyle -- the bibliography style local bibfiles = { } -- list of files named in order of file local citekeys = { } -- list of citation keys from .aux -- (in order seen, mixed case, no duplicates) local citedstar = false -- .tex contains \cite{} or \nocite{_}

<<using file [[auxname]], set [[bibstyle]], [[citekeys]], and [[bibfiles]]>>

if table.getn(argv) > 0 then -- override the bibfiles listed in the .aux file bibfiles = argv end <<validate contents of [[bibstyle]], [[citekeys]], and [[bibfiles]]>> <<from [[bibstyle]], [[citekeys]], and [[bibfiles]], compute and emit the list of entries>> blg:close() end @ Options are straightforward. <<set bibtex options from [[argv]]>>= while table.getn(argv) > 0 and find(argv[1], '^%-') do if argv[1] == '-terse' then -- do nothing elseif argv[1] == '-permissive' then permissive = true elseif argv[1] == '-strict' then strict = true elseif argv[1] == '-min-crossrefs' and find(argv[2], '^%d+$') then mincrossrefs = assert(tonumber(argv[2])) table.remove(argv, 1) elseif string.find(argv[1], '^%-min%-crossrefs=(%d+)$') then local , _, n = string.find(argv[1], '^%-min%-crossrefs=(%d+)$') min_crossrefs = assert(tonumber(n)) elseif string.find(argv[1], '^%-min%-crossrefs') then biberrorf("Ill-formed option %s", argv[1]) elseif argv[1] == '-o' then output_name = assert(argv[2]) table.remove(argv, 1) elseif argv[1] == '-bib' then bib_out = true elseif argv[1] == '-help' then help() elseif argv[1] == '-version' then printf("nbibtex version \n") os.exit(0) else biberrorf('Unknown option %s', argv[1]) help(2) end table.remove(argv, 1) end @ <>= local function help(code) printf([[ Usage: nbibtex [OPTION]... AUXFILE[.aux] [BIBFILE...] Write bibliography for entries in AUXFILE to AUXFILE.bbl.

Options: -bib write output as BibTeX source -help display this help and exit -o FILE write output to FILE (- for stdout) -min-crossrefs=NUMBER include item after NUMBER cross-refs; default 2 -permissive allow missing bibfiles and (some) duplicate entries -strict complain about any ill-formed entry we see -version output version information and exit

Home page at http://www.eecs.harvard.edu/~nr/nbibtex. Email bug reports to nr@eecs.harvard.edu. ]]) os.exit(code or 0) end @ \subsection{Reading all the aux files and validating the inputs}

We pay attention to four commands: [[\@input]], [[\bibdata]], [[\bibstyle]], and [[\citation]]. <<using file [[auxname]], set [[bibstyle]], [[citekeys]], and [[bibfiles]]>>= do local commands = { } -- table of commands we recognize in .aux files local function do_nothing() end -- default for unrecognized commands setmetatable(commands, { __index = function() return do_nothing end }) <<functions for commands found in .aux files>> commands'@input' -- reads all the variables end @ <<functions for commands found in .aux files>>= do local auxopened = { } --- map filename to true/false

commands['@input'] = function (auxname) if not find(auxname, '%.aux$') then bibwarnf('Name of auxfile "%s" does not end in .aux\n', auxname) end <<mark [[auxname]] as opened (but fail if opened already)>> local aux = open(auxname, 'r') logf('Top-level aux file: %s\n', auxname) for line in aux:lines() do local , , cmd, arg = find(line, '^([%a%@]+)%s{([^%}]+)}%s$') if cmd then commandscmd end end aux:close() end end <<mark [[auxname]] as opened (but fail if opened already)>>= if auxopened[auxname] then error("File " .. auxname .. " cyclically \@input's itself") else auxopened[auxname] = true end @ \bibtex\ expects \texttt{.bib} files to be separated by commas. They are forced to lower case, should have no spaces in them, and the [[\bibdata]] command should appear exactly once. <<functions for commands found in .aux files>>= do local bibdata_seen = false

function commands.bibdata(arg) assert(not bibdata_seen, [[LaTeX provides multiple \bibdata commands]]) bibdata_seen = true for bib in string.gmatch(arg, '[^,]+') do assert(not find(bib, '%s'), 'bibname from LaTeX contains whitespace') table.insert(bibfiles, string.lower(bib)) end end end @ The style should be unique, and it should be known to us. <<functions for commands found in .aux files>>= function commands.bibstyle(stylename) if bibstyle then biberrorf('Illegal, another \bibstyle command') else bibstyle = bibtex.style(string.lower(stylename)) if not bibstyle then bibfatalf('There is no nbibtex style called "%s"') end end end @
We accumulated cited keys in [[citekeys]]. Keys may be duplicated, but the input should not contain two keys that differ only in case. <<functions for commands found in .aux files>>= do local keys_seen, lower_seen = { }, { } -- which keys have been seen already

function commands.citation(arg) for key in string.gmatch(arg, '[^,]+') do assert(not find(key, '%s'), 'Citation key {' .. key .. '} from LaTeX contains whitespace') if key == '*' then cited_star = true elseif not keys_seen[key] then --- duplicates are OK keys_seen[key] = true local low = string.lower(key) <<if another key with same lowercase, complain bitterly>> if not cited_star then -- no more insertions after the star table.insert(citekeys, key) -- must be key, not low, -- so that keys in .bbl match .aux end end end end end @ <<if another key with same lowercase, complain bitterly>>= if lower_seen[low] then biberrorf("Citation key '%s' inconsistent with earlier key '%s'", key, lower_seen[low]) else lower_seen[low] = key end @ After reading the variables, we do a little validation. I~can't seem to make up my mind what should be done incrementally while things are being read. <<validate contents of [[bibstyle]], [[citekeys]], and [[bibfiles]]>>= if not bibstyle then bibfatalf('No \bibliographystyle in original LaTeX') end

if table.getn(bibfiles) == 0 then bibfatalf('No .bib files specified --- no \bibliography in original LaTeX?') end

if table.getn(citekeys) == 0 and not cited_star then biberrorf('No citations in document --- empty bibliography') end

do --- check for duplicate bib entries local i = 1 local seen = { } while i <= table.getn(bibfiles) do local bib = bibfiles[i] if seen[bib] then bibwarnf('Multiple references to bibfile "%s"', bib) table.remove(bibfiles, i) else i = i + 1 end end end
@ \subsection{Reading the entries from all the \bibtex\ files}

These are diagnostics that might be written to a log. <<from [[bibstyle]], [[citekeys]], an

nrnrnr commented 10 years ago

Yuck. I see that github didn't quite know what to do with the attachment. I've pushed it to https://github.com/nrnrnr/polymode/blob/master/multiple-mode-samples/nbib.nw.

Stats on that file:

Also, when I talked about regexps, I was a bit confused. What I really want is something like the auto-mode-alist variable, only on a buffer-local basis. So for example in the sample file I can have something like this:

(("\.c$" . c-mode)
 ("\.lua$" . lua-mode))

Other documents might use quite different conventions; for example,

(("^transcript$" . uscheme-transcript-mode)
 ("additions to the initial basis of .uscheme" . scheme-mode))

and so on.

nrnrnr commented 10 years ago

Hi Norman.

The dev doc is ready. I have gone through several stages of refactoring and settled down on parsimonious naming conventions. It also helped clearing my own mind. And I acknowledge that previous mode/polymode/chunkmode/submode/basemode etc. wording was quite a messup.

I had a quick look. How would you feel about my making an editing pass over these docs? For example, even for the devs, I think it would be helpful to begin with a short statement of the problem that polymode is indented to solve. Then, for example, each of the terms in the glossary could be related to that problem.

Norman

vspinu commented 10 years ago

I am not exceptionally good with words, so I would appreciate any improvements.

What "problem" do you have in mind except the obvious one of having multiple emacs modes in the same buffer?

I would prefer to keep the docs short and to the point. The doc is already longer than I would like them to be. Interested people should go to the code and examples to figure out the rest.

Vitalie

Norman Ramsey on Fri, 30 May 2014 11:19:09 -0700 wrote:

Hi Norman.

The dev doc is ready. I have gone through several stages of refactoring and settled down on parsimonious naming conventions. It also helped clearing my own mind. And I acknowledge that previous mode/polymode/chunkmode/submode/basemode etc. wording was quite a messup.

I had a quick look. How would you feel about my making an editing pass over these docs? For example, even for the devs, I think it would be helpful to begin with a short statement of the problem that polymode is indented to solve. Then, for example, each of the terms in the glossary could be related to that problem.

Norman

— Reply to this email directly or view it on GitHub.

nrnrnr commented 10 years ago

I am not exceptionally good with words, so I would appreciate any improvements.

What "problem" do you have in mind except the obvious one of having multiple emacs modes in the same buffer?

That's the one, with support for syntax highlighting &c.

I would prefer to keep the docs short and to the point. The doc is already longer than I would like them to be. Interested people should go to the code and examples to figure out the rest.

I'm happy with that plan.

Norman

vspinu commented 6 years ago

Hi Norman,

Your example works well in my tests. The automatic chunk mode detection is there. You can do it now in a number of different ways (see poly-noweb). The buffer local variable for the default mode is also there. The general docs have been improved and the technical docs will be there once the dust of the rewrite has settled.

I am closing this one as we will be removing ess-noweb from ESS immediately after the next release later this month.