Parsing C2C custom tags in documents text attributes

asaunier commented 8 years ago

In v5 a log of formatting is possible and based upon several syntaxes, mainly BBcode and Markdown. For the latter syntaxes a PR is available at #127

Many custom tags are also available:

new tags such as wikilinks [[summits/123456|link title]], [toc 2 right] or L# (climbing routes pitches)
existing tags but with special attributes, not supported out of the box by the parsers used in #127. For instance [img=188119 right]Lever de soleil sur le Mont Blanc[/img]

See the documentations about those syntaxe:

asaunier commented 8 years ago

wikilinks => see #156

asaunier commented 8 years ago

Images

First thing is to figure out weither to an extension for the bbcode parser or for the markdown parser.

Extending the bbcode parser looks fairly easy, see the doc/example at http://bbcode.readthedocs.org/en/latest/formatters.html

On the other hand the markdown parser seems a bit more flexible and powerful (and in the future we might want to get rid of the bbocode!?) => then it's probably smarter to only write markdown extensions?

There's a quite annoying issue with images: the img tags contains the document_id, not the filename itself. For instance [img=188119 right]Lever de soleil[/img]. => we will have to make a query to the database (ie. to the API) to get the actual filename.

Maybe we could convert the bbcode tag to an angular directive that would request the API to get the filename to use?
Maybe we could change the filename to use the document_id instead? A problem is that if a new file has been uploaded to a given image document, the document_id is the same but the filename is obviously different... Example of image document with several file versions: http://www.camptocamp.org/images/diff/194079/fr/3/2

stef74 commented 8 years ago

Liste of non supported syntax

[ ] L#
[x] [Image]
[x] [toc]
[ ] [q]
[ ] [c]
[x] [important],[warning]
[ ] column / table

fjacon commented 8 years ago

Highest priority : images

asaunier commented 8 years ago

The tricky thing with images is that the tag contains the image id but not the real filename (eg. [img=188119 right]Lever de soleil[/img]).

Perhaps a solution could be to convert the tag to some angular directive that makes an AJAX request to the API to get the image filename based upon the document id => embedded images would then be loaded asynchronously.

Or we could have a kind of proxy (webservice) that would convert an image-id-based image URL to the real image URL.

tsauerwein commented 8 years ago

Perhaps a solution could be to convert the tag to some angular directive that makes an AJAX request to the API to get the image filename based upon the document id => embedded images would then be loaded asynchronously.

Aren't these tags parsed on the UI server-side?

asaunier commented 8 years ago

Aren't these tags parsed on the UI server-side?

Not yet but they could. I'm just worrying that too many API requests on the UI server-side might delay the page rendering too much!?

asaunier commented 8 years ago

Liste of non supported syntax

This list should be considered in several parts.

[q], [c] seems to be shortcuts for [quote] and [code]. The latter tags are supported out of the box by our bbcode parser: http://bbcode.readthedocs.io/en/latest/tags.html => Not sure having shortcuts is a good thing, it adds complexity to the parsing. I would recommended to convert those shortcuts to the standard tags with the migration script. Perhaps we could even convert them to the Mardown syntax since getting rid from bbcode and relying on a single syntax as far as possible would be relevant.
Inline formatting such as important, warning that could probably be processed the same way than https://github.com/c2corg/v6_ui/blob/master/c2corg_ui/format/wikilinks.py
tags for which we could find parsers/extensions/inspiration on the web, if not too customized/specific. For instance col/table and toc (needs investigations). Our Markdown parser (python-markdown) has interesting extensions that could be used: https://pythonhosted.org/Markdown/extensions/index.html
- TOC: see https://pythonhosted.org/Markdown/extensions/toc.html
- newline (see https://github.com/c2corg/v6_ui/issues/420): see https://pythonhosted.org/Markdown/extensions/nl2br.html
- col/tables: see https://pythonhosted.org/Markdown/extensions/tables.html

Concerning tables, please note that http://www.camptocamp.org/articles/151910/fr/aide-mise-en-forme-du-texte#column describes a bbcode-like [col] which is obviously not Markdown-supported. On the other hand, the L# formatting seems to rely on the markdown table system. It's a bit a pity to use several redundant syntax system. Perhaps we should drop the [col] tag support?

complex tags such as [image] and L# that are heavily customized with a great amount of options. They will require a lot of work (perhaps simplifications or partial support would be OK?) and precise specs.

L# is really complicated. See http://www.camptocamp.org/articles/305462/fr/aide-topoguide-l-balise-de-description-des-longueurs with advanced numbering options. As a user I think it is way too complicated and am pretty sure most of the options are not used. I would recommend skipping some of them. Perhaps the association could precise what options should actually be supported. Please note that this syntax seems to use the Markdown syntax for tables. Perhaps we could split its processing in 2 parts:
the cols/lines rendering using the markdown syntax (not sure the :: separators as described at http://www.camptocamp.org/articles/305462/fr/aide-topoguide-l-balise-de-description-des-longueurs#syntaxe-de-base are actually supported)
the automatic numbering of the pitches (the # of L#).

Concerning images, the tricky thing is too convert the image id into an actual file URL (the filename changes if a new version of the image is uploaded whereas the image id remains the same). The filename is not available directly in the tag itself. See https://github.com/c2corg/v6_ui/issues/131#issuecomment-229337659 Several approaches are possible for the conversion: on the UI server side or on the UI client side with an asynchronous request. For both cases we probably need to make a request to the API to get the filename according to an image id (and a version id?). Additional info such as the image title + the width/size would be needed as well.

asaunier commented 8 years ago

Concerning tables, please note that http://www.camptocamp.org/articles/151910/fr/aide-mise-en-forme-du-texte#column describes a bbcode-like [col] which is obviously not Markdown-supported. On the other hand, the L# formatting seems to rely on the markdown table system. It's a bit a pity to use several redundant syntax system. Perhaps we should drop the [col] tag support?

After having a closer look at this tag, it seems it is not really a table syntax. Those tags are actually replaced by <div> and <p> tags with predefined CSS classes. So the processing could probably be done using a filter such as https://github.com/c2corg/v6_ui/blob/master/c2corg_ui/format/wikilinks.py

tsauerwein commented 8 years ago

Several approaches are possible for the conversion: on the UI server side or on the UI client side with an asynchronous request.

Making the request on the UI client side is probably better. The request to get the page from the UI server side returns faster, and the user could already see the page, while the image info is requested in the background.

asaunier commented 8 years ago

+1 The fact that making client side conversion would help loading the page faster is important.

Please note however that:

we will not benefit of the page caching on the UI server side.
if 10 images are embedded in the document, we will probably have to make 10 API requests every time the page is loaded.

asaunier commented 8 years ago

Concerning images, the tricky thing is too convert the image id into an actual file URL (the filename changes if a new version of the image is uploaded whereas the image id remains the same). The filename is not available directly in the tag itself. See #131 (comment) Several approaches are possible for the conversion: on the UI server side or on the UI client side with an asynchronous request. For both cases we probably need to make a request to the API to get the filename according to an image id (and a version id?). Additional info such as the image title + the width/size would be needed as well.

A possibility would be to make a simple conversion from

[img=37130 right]Gramusset[/img]

to

<img src="//api.camptocamp.org/images/proxy/37130" class="right-image" alt="Gramusset">

api.camptocamp.org/images/proxy/37130 would be a proxy service in the API that would make a request to the database to get the image filename and would redirect to the actual URL of the image file.

This way:

no need to make complicated AJAX requests to the API from the UI side (either client side or server side)
the image HTML code in the UI page is cached as the rest of the page
if a new version of the image is uploaded, the HTML code is still valid
the API service may be cached as well (with a not too big TTL so that a new version of the image is taken into account after some time)
the [img] tag can be processed with an extension of the markdown parser, similarly to wikilinks https://github.com/c2corg/v6_ui/blob/master/c2corg_ui/format/wikilinks.py
of course we only get the image file URL, not additional data such as width or height (but neither do we in v5 conversions)

@tsauerwein do you think it could work?

tsauerwein commented 8 years ago

@tsauerwein do you think it could work?

That would work.

I just had another idea: We could also replace the image tags directly in the API when returning the document. For example when requesting /wayponts/1, the API would replace [img=37130 right]Gramusset[/img] with [img filename=/.../xyz.jpg right]Gramusset[/img]. This way we would avoid a couple of round-trips.

Requesting a document for editing would still return [img=37130 right]Gramusset[/img]. This might be a problem for the preview. For this we could still use the proxy redirect technique that you proposed. But the most common case (showing a document) would be optimized.

asaunier commented 8 years ago

That sounds interesting.

A drawback of this method is that some of the tag parsing (at least the id>filename conversion) is deported to the API instead of having all the parsing done at a single place (in the UI). In addition there's this consistency issue between the read/preview/write.

fbunoz commented 8 years ago

A drawback of this method is that some of the tag parsing (at least the id>filename conversion) is deported to the API instead of having all the parsing done at a single place (in the UI).

Another solution for the edition, is to add a field in the data returned by the API, which contains an array with IDs and file names of all images associated to the current doc :

{
    ID1 : file_name1
    ID2 : file_name2
    ID3 : file_name3
}

For the view page, these info are already downloaded to generate the image list.

tsauerwein commented 8 years ago

... of all images associated to the current doc

Can you only insert an image tag for images that are associated to the document? I think we don't want to enforce that constraint in v6?

But I like your idea of including the filenames on the returned document. The API would have to parse the text for image tags, but that is ok.

For the preview we still might need an API web-service which returns filenames for a list of image ids.

asaunier commented 7 years ago

I have tested enabling some standard Python-Markdown extensions with PR https://github.com/c2corg/v6_ui/pull/695 See my comments in it.

To summarize:

line breaks: OK
TOC: more or less OK but sophisticated syntaxes such as [toc 2 right] are not supported out of the box
tables (useful for L# ?): KO

stef74 commented 7 years ago

@asaunier I propose to close ... for toc2 we can create dedicated issue with association label

asaunier commented 7 years ago

I prefer to leave the issue open because it contains many info and analysis about the tag parsing.

asaunier commented 7 years ago

Tutorial about extending python-markdown: https://github.com/waylan/Python-Markdown/wiki/Tutorial:-Writing-Extensions-for-Python-Markdown

asaunier commented 7 years ago

I have the feeling that using both bbcode and markdown formatting is a nonsense:

it complicates the understanding by users because 2 systems are used whereas only one should be enough
it will probably complicate the addition of a set of buttons used to help people add such formatting
it may cause conflicts. For instance the bbcode parser unexpectedly replaces the image URLs (added by the [img] tag parser at https://github.com/c2corg/v6_ui/pull/775) by HTML links unless it is run before the markdown parser (as for now it was run last, not sure the switch won't break something).

What about reyling only on the markdown parser?

It would mean we would have to convert the bbcode tags by markdown tags (at least the common ones) when migrating the data. But some converting is needed anyway: https://github.com/c2corg/v6_api/issues/417

asaunier commented 7 years ago

An other argument for getting rid of the bbcode: it will make things even more complicated when we will add a text formatting wizard/editor! See https://github.com/c2corg/v6_ui/issues/305

asaunier commented 7 years ago

I set this issue to milestone golive because it requires changes in the migration script, at least for some tag converting https://github.com/c2corg/v6_api/issues/417

fjacon commented 7 years ago

@arnaud-morvan or @tsauerwein : have you taken Alex last comment into account in migration script ?

tsauerwein commented 7 years ago

have you taken Alex last comment into account in migration script ?

The issues related to the migration are tracked here: https://github.com/c2corg/v6_api/issues/417 One point is left: https://github.com/c2corg/v6_api/issues/417#issuecomment-256397038

asaunier commented 7 years ago

One point is left: c2corg/v6_api#417 converting common bbcode tags to markdown??

This is an open question, I don't have the answer. Getting rid of bbcode tags would imply to make more conversions in the migration script (as done for some shortcuts and wikilinks in https://github.com/c2corg/v6_api/pull/478) but I have the feeling it might simplify things in the future to have only one formatting syntax instead of 2.

@stef74 Please note that I have not made the [toc] syntax simplification (removing unsupported options) (striked out in https://github.com/c2corg/v6_api/issues/417#issue-181414714) because @fbunoz was opposed to it (see https://github.com/c2corg/v6_ui/pull/695#issuecomment-253619271). As a result there are probably many pages whose TOC is broken, I am not sure it is really relevant.

stef74 commented 7 years ago

@asaunier this ticket can be close ? (dedicated ticket already exist) and no action on SA side ?

tsauerwein commented 7 years ago

I created a separate issue for the remaining problem (bbcode -> Markdown): https://github.com/c2corg/v6_api/issues/516

tsauerwein commented 7 years ago

Sorry that was the wrong issue.

c2corg / v6_ui

Parsing C2C custom tags in documents text attributes #131

Images