drake-mer commented 12 years ago

Hello,

I run into pelican. I wanted to improve the build system. The way it is done now is to generate everything from the whole set of source files. I think this needs to be removed/upgraded.

Why ? Because the build system is not modular at all. First problem is the need to parse again every source file (md or rst or whatever) when only one has been added or removed.

The problem I see is the 'tag' cloud. To build the 'tag' cloud and 'category' cloud we need to run across all the existing files. From then you are going to build every article in your blog engine, then the feed parsers, etc.

If I take the example of a standard Makefile to compile a C language project, every single file makes a .o file and then the '.o' (objects) are gathered into the executable. This is slightly different in pelican, since here the '.o' objects ('.html' files) are all interdependants because they rely on the list of tags and category (we call them metadata) to be build properly, which does not make the build process very modular.

I don't see exactly how to bring more modularity in the build process, but I think that drastic changes in the pelican code could allow to use either an internal build system or an external build system and would make the pelican app more suitable to speed optimization.

One could think of a way to split the build process in several targets :

pelican-template
pelican-metadata
pelican-article
pelican-feeds

In some way, this would avoid code duplication and it would be cleaner for the end user. Then one could more easily use an efficient build system using eg make or an internal script. Not to mention also the advantages for the code quality and easiness of reading.

This could answer issue #310 and #224, and even other one I guess through improvement of the code quality. What do you think of it ? I am ready to take on my own some of the dev, but I would be glad to hear your opinion in the matter.

David

almet commented 12 years ago

Hi David,

Thanks for the feedback. Having a build system like that indeed would be great for pelican and reduce drastically the time needed to build the blog. Having a cache of article → metadata + html (the piece of work currently done by the readers) seems to make possible to avoid doing this each time.

As for the other targets you are proposing, I'm not sure to get how they can be used here. Do you have a more in depth idea of how to handle that? For instance, what are you referring to when thinking about the pelican-template target?

Also, About the approach taken, I'm not sure building this to allow interaction with external build systems is a completely good idea, because of the complexity it could add to the code. I don't know about the existing integrated build mechanisms, but it seems to me more adapted because we would not need to split the logic of pelican too much. Of course, if it's not already split enough, we could change that to make the build process faster, but I want to be sure that's something needed first, and exposing these steps to the world seems… maybe a bit too much (but I'm open to be convinced of the contrary).

Thanks for opening this, Alex

drake-mer commented 12 years ago

Hello, thank you for your feedback. I did not dive very far in the pelican code, that's why my suggestion of the split of the pelican executable can lack of consistency. Maybe I am going to inspect the code more thoroughly and maybe send you a patch. Just a general direction, without specific goal.

The main stuff would be caching the already written articles. Especially, everything that is in the <aside> markup need to be re-generated each time. That is what I would call metadata, and apparently it is the same for every article in the output/ folder.

So one could split the execution of pelican into three step :

Caching articles. Let unchanged articles as they are. Delete cached articles that have been removed from source.
Re-generate the so-called metadata (inside the <aside> markup, but it could be somewhere else things that are shared between articles)
Something like a pelican-publish command that would gather the metadata (maybe feeds have to be included in metadata ?) and the content in the output/ directory

almet commented 12 years ago

okay, metadata is something different. See in the readers.py file. We are parsing the content files (markdown or restructured text) to HTML, extracting in the same time some metadata at the same time.

+1 for caching content + metadata. We can use a md5 for this.
Only regenerate the content + metadata if the md5 sum of the content is different.
I'm not following you here.

224 does have some ideas on how to do that. I think it's a matter of adding a layer between the readers and the generator (the reader could read information from a separate place on the disk, for instance).

tshepang commented 12 years ago

@ametaireau

Have you looked at doit? I haven't myself, but some project similar to Pelican uses it to avoid the same issue of re-building everything.

almet commented 12 years ago

@tshepang any link?

tshepang commented 12 years ago

http://nikola.ralsina.com.ar/

drake-mer commented 12 years ago

What do you think of using SQLite or a flat-text oriented database to store the articles hash ?

almet commented 12 years ago

well, do we really need something else than the python's pickle module? Adding a db as a dependency sounds… not something pelican's users would like ;)

neoascetic commented 12 years ago

@ametaireau agreed

joedicastro commented 12 years ago

@ametaireau agreed

drake-mer commented 12 years ago

Ok I didn't know that a python db module exists - so basically: agreed.

almet commented 12 years ago

That's not a db, that's a way to serialize / unserialize objects to/from disk.

So, do you plan to work on that? :-)

drake-mer commented 12 years ago

Yes, I'd like to work on that. I will start a new branch with a smart build system.

almet commented 12 years ago

any news?

drake-mer commented 12 years ago

My lack of understanding with git is tearing me off. And you ? You just remind me I need to work on it somehow.

almet commented 12 years ago

You don't need to understand anything with git, just get the code with "git clone" and then hack it. If you prefer you can then send me the result of "git diff" so I can reveiw your changes.

drake-mer commented 12 years ago

hi amaitereau, thanks for your feedback.

I am currently hacking ... I 'll keep you in touch for this. Thanks for your encouragements.

almet commented 12 years ago

awesome, thanks!

gal-leib commented 12 years ago

So you suggest we build some kind of 'dependency graph'? Well, we could easily check if we need to generate a file, based on the file in the output folder if it doesn't exist or older than source, we should generate it.

We can also store the template the page is using, and store it as a dependency and also the templates the templates include, recursively. it's very simple with jinja2. So if the template is updated, we'll need to rebuild the file.

Oh, and we also need to generate the whole site if the settings changed. We could store the graph in a pickle or something.

But it still won't solve the issue of generating a lot of the pages if something common to all is updated. let's say I want a list on the right with 5 recent posts, or a tag cloud. Even if we have the whole meta-data cached, we're still gonna re-generate the whole site. Or if we want to work on a new theme, every change will require us to rebuild all the posts.

So a dependency graph could make a small improvement, but it won't solve the problem if you are changing something that appears in most, if not all, of the pages, it won't be a big time-saver in large sites.

Other static site generators suffer the same problem. We basically went from generating the page on a new request on the server dynamically, to generating THE WHOLE SITE in advance. For small sites it is better, but if you'll search you'll see a lot of people moving back from static blogs because of the long build time with a lot of posts.

TL;DR we can implement a dependency graph rather easily, but still won't solve the issue of regenerating the whole site on variable or templates changes.

almet commented 12 years ago

As said, I think we need to:

have a cache of the raw content (rst, markdown, whatever) to html (using a md5 sum of the content as the cache invalidator sounds sufficient)
regenerate the whole output each time, in all cases (but not reading all the files on disc + building the python objects, since we can just load them from a pickle file)

If we want to use a dependency graph to only regenerate the files that changed, then we could do that with listing all the parameters that can change on a page. The thing is that those are not listed anywhere clearly. for instance, templates can go and use anything that's on the settings and generated dynamically (such as the last generation date for instance).

I don't really think we need any kind of dependency graph, but away to avoid reading all the files each time. then that's not really a problem if we regenerate all of them, the reading/parsing takes some time, not the output.

If I remember correctly, that's what sphinx does, for instance.

gal-leib commented 12 years ago

We could maybe pickle each page as we generate it. Maybe even contain only the metadata, and not the compiled result. Then instead of reading files, we'll check if the file is newer that the pickle file, and if so we need to parse and read it. Otherwise, read the pickle/metadata file and use that.

But even if we have cut down on read time, if we need to compile a page, we would have to run it through jinjga 2, and write the file. That's why I am suggesting do store the dependency of each file, so we won't have to regenerate them all.

Then we have the case when a user makes a new page for ex. and we need to rebuild the whole site. The only way to discover which files would need to change based on a variable would be to store the variables used in the template (recursively) and then check if those were changed (i.e added a new post since or added a menu link in conf). But then we would need a way for plugins and custom tags to report that they are possible changing dynamically and this is too much of a headache.

I would just make incremental building default, and add an argument to regenerate everything, and document this use case (i.e. 5 recent posts list).

drake-mer commented 12 years ago

Well, I just wanted to tell that I won't be available until some monthes. Sorry about that. I see that ppl are interesting in that, sounds good.

almet commented 12 years ago

Closing for now, since we didn't had any news. Still, having a build system is a good idea.

xxks-kkk commented 6 years ago

I notice this issue is 6 years old. I'm just curious if right now, we have implemented any smart build system so that I don't have to build every post every time. If so, can anyone point me any configuration I need to do? Right now, it takes about 1 minutes to build everything for me. Thanks!

justinmayer commented 6 years ago

@xxks-kkk: Assuming you are only making changes to a single page/post, you can generate just that page/post. Instructions for doing so are in the documentation.

xxks-kkk commented 6 years ago

@justinmayer Thanks for the info. I tried the following command

pelican --write-selected /Users/zeyuan/Documents/projects/linuxjedi.co.uk/output/posts/2018/Apr/05/on-writereadspace-amplification/index.html

and itself still triggers the whole build process. Am I looking at the wrong command?

Thanks!

justinmayer commented 6 years ago

That should be the one. If you need further help, you might consider reaching out via the Pelican IRC channel, as noted in the How to Get Help section of the documentation.

getpelican / pelican

Pelican : build system #324

224 does have some ideas on how to do that. I think it's a matter of adding a layer between the readers and the generator (the reader could read information from a separate place on the disk, for instance).