Closed duzyn closed 7 years ago
All spreadsheet programs can also export tab separated values (TSV) and those are very easy to convert into Markdown pipe tables using a decent text editor which supports regular expressions and preferably likewise substitutions. The following Perl program "tsv2pipetable.pl" takes a TSV file on stdin, converts it into a pipe table assuming that the first line is a header row and spits out the pipe table on stdout. You can use it as a filter in pp with
!exec(perl tsv2pipetable.pl <filename.tsv)
#!/usr/bin/env perl
# tsv2pipetable.pl - convert a file with tab separated values to a Pandoc pipe table
#
# Usage:
#
# perl tsv2pipetable.pl <filename.tsv >pipetable.md
use strict;
use warnings;
use 5.008005
use open qw[ :utf8 :std ];
my $colfill = '-' x 5; # for the column spec row
# slurp all lines into an array
my @rows = <>;
# construct the column spec row
## make a copy of the first line
unshift @rows, $rows[0];
## replace column contents in the colspec row with five dashes each
$rows[1] =~ s/[^\t\n]+/$colfill/g;
# process each row
for ( @rows ) {
## escape existing pipes if any
s/\|/\\|/g;
## replace tabs with pipes
s/\t/|/g;
## insert a pipe at the start of each row
s/^/|/;
}
# join the lines and print the pipetable to stdout
print STDOUT join "", @rows;
__END__
I hope this helps. Just ask if something isn't clear!
Good idea.
To avoid external dependencies I can add a macro \csv
that takes the name of a CSV file.
For example: \csv(file.csv)
Which table format in the output file would be the better? May be grid table to allow multiline cells.
The format could be deduced from the nature of columns (number are right-aligned, text is left aligned). If the csv file has no header, it can be given as a parameter.
\csv(file.csv)(field 1 | field 2 | ...) ==> add a header line
\csv(file.csv)() ==> headless table containing all the lines of the csv file
\csv(file.csv) ==> the first line of the csv file is used as the header line
The format of the input file could be inferred from the most frequent character amount some popular separators (',' ';' tab ...).
I also found some others' project like these ones:
Just make sure that they support quoted values and escaped comma/quote characters! The good thing with TSV is that tabs are simply illegal in values. Thus you are sure that all tabs are field separators and no quoting or escaping is needed.
I have added the csv
macro. See https://github.com/CDSoft/pp#csv-tables
It's based on a Haskell CSV parser which supports quoted values. The separator is the most frequent separator in the file. I guess it should work most of the time. If not I'll add an optional parameter to define it.
Thanks @duzyn and @CDSoft ! This new macro is really useful.
pp convert the csv file to grid_tables which require lining up columns. Now there is a problem when converting CJK csv files, because pp can't line up CJK characters.
This is the original Chinese csv:
中文(Chinese),英文(English),价格(Price)
香蕉,Banana,3.9
苹果,Apple,4.5
And this is the converted table:
+-------------+-------------+-----------+
| 中文(Chinese) | 英文(English) | 价格(Price) |
+:============+:============+==========:+
| 香蕉 | Banana | 3.9 |
+-------------+-------------+-----------+
| 苹果 | Apple | 4.5 |
+-------------+-------------+-----------+
And this is the converted HTML by Pandoc:
<table style="width:58%;">
<colgroup>
<col style="width: 19%" />
<col style="width: 19%" />
<col style="width: 19%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">中文(Chinese</th>
<th style="text-align: left;">) | 英文(Eng</th>
<th style="text-align: right;">lish) | 价格(Price)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">香蕉</td>
<td style="text-align: left;"><div class="line-block">Banana</div></td>
<td style="text-align: right;"><div class="line-block"> 3.9</div></td>
</tr>
<tr class="even">
<td style="text-align: left;">苹果</td>
<td style="text-align: left;"><div class="line-block">Apple</div></td>
<td style="text-align: right;"><div class="line-block"> 4.5</div></td>
</tr>
</tbody>
</table>
That's hardly pp's fault. The problem is that most text editors still believe that fixed-width font means "single-width-font" while most CJK characters, unlike Western and most other characters notably including space characters, take up two width units in fixed width fonts. There are essentially two workarounds:
Use a font where Western characters and space characters are the same width as CJK characters. I don't know if such a fixed-width font exists. The Western characters would probably look strange not to say ugly, or be separated by unnormal amounts of whitespace which also might be called ugly, in such a font.
A tab (de)expansion program which takes the double-width of CJK characters into acount. I don't know if such a program exists, but it should not be forbiddingly hard to write if you use a language or library with enough Unicode savvy to tell CJK and non-CJK characters apart -- in fact I almost want to try to do so, except that I don't have the time.[1] That said it is probably not reasonable to expect that pp should include such functionality unless there already is a Haskell or C library which can be integrated. Something like Tabular.vim might contain hints how to write such a program. What you certainly can do with Tabular.vim is realigning pipe tables; I do it all the time! It is also pretty easy to define shell aliases for invoking pandoc with the table extensions of your choice disabled or enabled for reformatting tables between different formats (assuming that your editor can filter part of a file through an external program; as a Vim user I take that for granted I'm afraid!)
[1] A simple algorithm which may work:
This of course requires that 4 or 5 is possible within the scope of one table cell in the first place, and as far as Pandoc currently is concerned it requires that you use pipe tables which isn't always convenient.
Please note that I sympathize with you a lot. In my linguistic work I need to use different scripts, not including CJK though, and lots of combining marks. Most text editors suck at handling those. Vim at least doesn't break because of them! Unfortunately there is no easy fix in sight since most programming tools are built on the assumption or requirement of single-width fonts without any combining marks, if not ASCII.
Den 2018-11-24 kl. 11:14, skrev David Peng:
pp convert the csv file to grid_tables which require lining up columns. Now there is a problem when converting CJK csv files, because pp can't line up CJK characters.
@duzyn:
Now there is a problem when converting CJK csv files, because pp can't line up CJK characters.
Probably you should consider using a pandoc filter to process PP's output file and cleanup the table cells alignment:
You can write filters in almost any scripting language you're comfortable with, Haskell and Lua being supported natively by pandoc and not requiring third party tools/languages to be installed.
I'd probably go for a Lua filter, and just iterate through every column twice: once to calculate the widest cell, the second pass to add trailing spaces to every cell that is shorter than that max value. Keeping track of Chinese chars which should count as two-spaces shouldn't be hard if you know the Unicode ranges you're looking for.
I just discovered that both Vim's :retab
command and Tabular.vim
take due care of double-width chars, though I don't know how.
So if you are using Vim just install Tabular.vim, put this mapping
into your .vimrc (all just on one line in case the mailer breaks
the line!)
nnor <Leader>g2p vip:v/\|/d<cr>{jyyp:s/[^\|]/-/g<cr>gv:Tabu
/\|<cr>j:s/[^\|]/-/g<cr>
then in normal mode place the cursor anywhere inside the grid
table produced by pp an type \g2p
and you should have a nicely
lined-up pipe table!
Den 2018-11-24 kl. 19:27, skrev Benct Philip Jonsson:
That's hardly pp's fault. The problem is that most text editors still believe that fixed-width font means "single-width-font" while most CJK characters, unlike Western and most other characters notably including space characters, take up two width units in fixed width fonts. There are essentially two workarounds:
Use a font where Western characters and space characters are the same width as CJK characters. I don't know if such a fixed-width font exists. The Western characters would probably look strange not to say ugly, or be separated by unnormal amounts of whitespace which also might be called ugly, in such a font.
A tab (de)expansion program which takes the double-width of CJK characters into acount. I don't know if such a program exists, but it should not be forbiddingly hard to write if you use a language or library with enough Unicode savvy to tell CJK and non-CJK characters apart -- in fact I almost want to try to do so, except that I don't have the time.[1] That said it is probably not reasonable to expect that pp should include such functionality unless there already is a Haskell or C library which can be integrated. Something like [Tabular.vim] might contain hints how to write such a program. What you certainly can do with Tabular.vim is realigning pipe tables; I do it all the time! It is also pretty easy to define shell aliases for invoking pandoc with the table extensions of your choice disabled or enabled for reformatting tables between different formats (assuming that your editor can filter part of a file through an external program; as a Vim user I take that for granted I'm afraid!)
[1] A simple algorithm which may work:
- Use expand(1) or its equivalent normally.
- Split each line into chunks of characters one "tabstop" long.
- Count the number of CJK characters in each chunk.
- Remove that number of space characters at the end of the chunk if possible.
- If 4. isn't possible keep counting CJK characters until you reach a chunk where the required number of space characters can be removed.
This of course requires that 4 or 5 is possible within the scope of one table cell in the first place, and as far as Pandoc currently is concerned it requires that you use pipe tables which isn't always convenient.
[Tabular.vim]: https://www.google.com/search?q=vim+tabular
Please note that I sympathize with you a lot. In my linguistic work I need to use different scripts, not including CJK though, and lots of combining marks. Most text editors suck at handling those. Vim at least doesn't break because of them! Unfortunately there is no easy fix in sight since most programming tools are built on the assumption or requirement of single-width fonts without any combining marks, if not ASCII.
Den 2018-11-24 kl. 11:14, skrev David Peng:
pp convert the csv file to [grid_tables] which require lining up columns. Now there is a problem when converting CJK csv files, because pp can't line up CJK characters.
[grid\_tables]: http://pandoc.org/MANUAL.html#tables
Den 2018-11-24 kl. 20:06, skrev Tristano Ajmone:
I'd probably go for a Lua filter, and just iterate through every column twice: once to calculate the widest cell, the second pass to add trailing spaces to every cell that i shorter than that max value. Keeping track of Chines chars which should count as two-spaces shouldn't be hard if you know the Unicode ranges you're looking for.
That would actually be hell itself to do in Lua, because Lua knows nothing about Unicode. You would have to check for the right byte sequences --- good luck with that!
But before doing that you would need to have Pandoc parse the original table correctly which it demonstrably does not, which is a bigger problem than what the table looks like on screen. At least I guess that the OP could live with the table looking weird on screen if only Pandoc parsed it correctly. So you are back at square one --- a Pandoc filter is of no help!
However I have figured out how the OP can get Pandoc to parse the example table without very much manual work:
After all the only safe way is to give Pandoc a pipe table, because those are parsed correctly by Pandoc even if they don't line up visually/on screen, and thinking it through some more I think that pp actually could have an option to output pipe tables.
In the meantime the best the OP can do is to convert the grid table output by pp to a pipe table manually, which luckily isn't all that hard: just remove all lines which don't contain any pipes, insert a header separator row and voilà! It won't look nice on screen, but it will parse correctly, giving correct HTML output hopefully be rendered correctly by web browsers.
Step 1: remove all row separators except the top one:
+-------------+-------------+-----------+ <-- keep
| 中文(Chinese) | 英文(English) | 价格(Price) |
+:============+:============+==========:+ <-- remove
| 香蕉 | Banana | 3.9 |
+-------------+-------------+-----------+ <-- remove
| 苹果 | Apple | 4.5 |
+-------------+-------------+-----------+ <-- remove
+-------------+-------------+-----------+ <-- kept
| 中文(Chinese) | 英文(English) | 价格(Price) |
| 香蕉 | Banana | 3.9 |
| 苹果 | Apple | 4.5 |
Step 2: swap the two top rows:
| 中文(Chinese) | 英文(English) | 价格(Price) |
+-------------+-------------+-----------+ <-- moved
| 香蕉 | Banana | 3.9 |
| 苹果 | Apple | 4.5 |
Step 3: replace the pluses in the separator row with pipes and insert an alignment marker for the price column
| 中文(Chinese) | 英文(English) | 价格(Price) |
|-------------|-------------|-----------:| <-- changed
| 香蕉 | Banana | 3.9 |
| 苹果 | Apple | 4.5 |
Step 4: convert with pandoc:
<table>
<thead>
<tr class="header">
<th>中文(Chinese)</th>
<th>英文(English)</th>
<th style="text-align: right;">价格(Price)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>香蕉</td>
<td>Banana</td>
<td style="text-align: right;">3.9</td>
</tr>
<tr class="even">
<td>苹果</td>
<td>Apple</td>
<td style="text-align: right;">4.5</td>
</tr>
</tbody>
</table>
In Vim invoking the following mapping in normal mode with the cursor somewhere in the grid table does the trick without any need for any plugin (although the mapping in my previous post using Tabular.vim will also give a good-looking pipe table!
nnor \g2p vip<esc>yygv:v/|/d<cr>{jp:s/+/|/g<cr>
I wrote:
nnor \g2p vip
yygv:v/|/d {jp:s/+/|/g
forgetting to escape the pipes! It should read:
nnor \g2p vip<esc>yygv:v/\|/d<cr>{jp:s/+/\|/g<cr>
pp could also generate pipe tables directly (at least in markdown).
pp 2.7.1 generates pipe tables (in markdown only).
Thanks Christophe!
It would be nice if pp can include CSV file as markdown table like iA write did.