Closed drake-mer closed 12 years ago
Hi David,
Thanks for the feedback. Having a build system like that indeed would be great for pelican and reduce drastically the time needed to build the blog. Having a cache of article → metadata + html (the piece of work currently done by the readers) seems to make possible to avoid doing this each time.
As for the other targets you are proposing, I'm not sure to get how they can be used here. Do you have a more in depth idea of how to handle that? For instance, what are you referring to when thinking about the pelican-template target?
Also, About the approach taken, I'm not sure building this to allow interaction with external build systems is a completely good idea, because of the complexity it could add to the code. I don't know about the existing integrated build mechanisms, but it seems to me more adapted because we would not need to split the logic of pelican too much. Of course, if it's not already split enough, we could change that to make the build process faster, but I want to be sure that's something needed first, and exposing these steps to the world seems… maybe a bit too much (but I'm open to be convinced of the contrary).
Thanks for opening this, Alex
Hello, thank you for your feedback. I did not dive very far in the pelican code, that's why my suggestion of the split of the pelican executable can lack of consistency. Maybe I am going to inspect the code more thoroughly and maybe send you a patch. Just a general direction, without specific goal.
The main stuff would be caching the already written articles. Especially, everything that is in the <aside>
markup need to be re-generated each time. That is what I would call metadata, and apparently it is the same for every article in the output/
folder.
So one could split the execution of pelican into three step :
<aside>
markup, but it could be somewhere else things that are shared between articles)output/
directoryokay, metadata is something different. See in the readers.py file. We are parsing the content files (markdown or restructured text) to HTML, extracting in the same time some metadata at the same time.
@ametaireau
Have you looked at doit? I haven't myself, but some project similar to Pelican uses it to avoid the same issue of re-building everything.
@tshepang any link?
What do you think of using SQLite or a flat-text oriented database to store the articles hash ?
well, do we really need something else than the python's pickle module? Adding a db as a dependency sounds… not something pelican's users would like ;)
@ametaireau agreed
@ametaireau agreed
Ok I didn't know that a python db module exists - so basically: agreed.
That's not a db, that's a way to serialize / unserialize objects to/from disk.
So, do you plan to work on that? :-)
Yes, I'd like to work on that. I will start a new branch with a smart build system.
any news?
My lack of understanding with git is tearing me off. And you ? You just remind me I need to work on it somehow.
You don't need to understand anything with git, just get the code with "git clone" and then hack it. If you prefer you can then send me the result of "git diff" so I can reveiw your changes.
hi amaitereau, thanks for your feedback.
I am currently hacking ... I 'll keep you in touch for this. Thanks for your encouragements.
awesome, thanks!
So you suggest we build some kind of 'dependency graph'? Well, we could easily check if we need to generate a file, based on the file in the output folder if it doesn't exist or older than source, we should generate it.
We can also store the template the page is using, and store it as a dependency and also the templates the templates include, recursively. it's very simple with jinja2. So if the template is updated, we'll need to rebuild the file.
Oh, and we also need to generate the whole site if the settings changed. We could store the graph in a pickle or something.
But it still won't solve the issue of generating a lot of the pages if something common to all is updated. let's say I want a list on the right with 5 recent posts, or a tag cloud. Even if we have the whole meta-data cached, we're still gonna re-generate the whole site. Or if we want to work on a new theme, every change will require us to rebuild all the posts.
So a dependency graph could make a small improvement, but it won't solve the problem if you are changing something that appears in most, if not all, of the pages, it won't be a big time-saver in large sites.
Other static site generators suffer the same problem. We basically went from generating the page on a new request on the server dynamically, to generating THE WHOLE SITE in advance. For small sites it is better, but if you'll search you'll see a lot of people moving back from static blogs because of the long build time with a lot of posts.
TL;DR we can implement a dependency graph rather easily, but still won't solve the issue of regenerating the whole site on variable or templates changes.
As said, I think we need to:
If we want to use a dependency graph to only regenerate the files that changed, then we could do that with listing all the parameters that can change on a page. The thing is that those are not listed anywhere clearly. for instance, templates can go and use anything that's on the settings and generated dynamically (such as the last generation date for instance).
I don't really think we need any kind of dependency graph, but away to avoid reading all the files each time. then that's not really a problem if we regenerate all of them, the reading/parsing takes some time, not the output.
If I remember correctly, that's what sphinx does, for instance.
We could maybe pickle each page as we generate it. Maybe even contain only the metadata, and not the compiled result. Then instead of reading files, we'll check if the file is newer that the pickle file, and if so we need to parse and read it. Otherwise, read the pickle/metadata file and use that.
But even if we have cut down on read time, if we need to compile a page, we would have to run it through jinjga 2, and write the file. That's why I am suggesting do store the dependency of each file, so we won't have to regenerate them all.
Then we have the case when a user makes a new page for ex. and we need to rebuild the whole site. The only way to discover which files would need to change based on a variable would be to store the variables used in the template (recursively) and then check if those were changed (i.e added a new post since or added a menu link in conf). But then we would need a way for plugins and custom tags to report that they are possible changing dynamically and this is too much of a headache.
I would just make incremental building default, and add an argument to regenerate everything, and document this use case (i.e. 5 recent posts list).
Well, I just wanted to tell that I won't be available until some monthes. Sorry about that. I see that ppl are interesting in that, sounds good.
Closing for now, since we didn't had any news. Still, having a build system is a good idea.
I notice this issue is 6 years old. I'm just curious if right now, we have implemented any smart build system so that I don't have to build every post every time. If so, can anyone point me any configuration I need to do? Right now, it takes about 1 minutes to build everything for me. Thanks!
@xxks-kkk: Assuming you are only making changes to a single page/post, you can generate just that page/post. Instructions for doing so are in the documentation.
@justinmayer Thanks for the info. I tried the following command
pelican --write-selected /Users/zeyuan/Documents/projects/linuxjedi.co.uk/output/posts/2018/Apr/05/on-writereadspace-amplification/index.html
and itself still triggers the whole build process. Am I looking at the wrong command?
Thanks!
That should be the one. If you need further help, you might consider reaching out via the Pelican IRC channel, as noted in the How to Get Help section of the documentation.
Hello,
I run into pelican. I wanted to improve the build system. The way it is done now is to generate everything from the whole set of source files. I think this needs to be removed/upgraded.
Why ? Because the build system is not modular at all. First problem is the need to parse again every source file (md or rst or whatever) when only one has been added or removed.
The problem I see is the 'tag' cloud. To build the 'tag' cloud and 'category' cloud we need to run across all the existing files. From then you are going to build every article in your blog engine, then the feed parsers, etc.
If I take the example of a standard Makefile to compile a C language project, every single file makes a .o file and then the '.o' (objects) are gathered into the executable. This is slightly different in pelican, since here the '.o' objects ('.html' files) are all interdependants because they rely on the list of tags and category (we call them metadata) to be build properly, which does not make the build process very modular.
I don't see exactly how to bring more modularity in the build process, but I think that drastic changes in the pelican code could allow to use either an internal build system or an external build system and would make the pelican app more suitable to speed optimization.
One could think of a way to split the build process in several targets :
In some way, this would avoid code duplication and it would be cleaner for the end user. Then one could more easily use an efficient build system using eg make or an internal script. Not to mention also the advantages for the code quality and easiness of reading.
This could answer issue #310 and #224, and even other one I guess through improvement of the code quality. What do you think of it ? I am ready to take on my own some of the dev, but I would be glad to hear your opinion in the matter.
David