CDSoft / pp

PP - Generic preprocessor (with pandoc in mind) - macros, literate programming, diagrams, scripts...
http://cdelord.fr/pp
GNU General Public License v3.0
252 stars 21 forks source link

Include CSV file as markdown table #23

Closed duzyn closed 7 years ago

duzyn commented 7 years ago

It would be nice if pp can include CSV file as markdown table like iA write did.

Markdown tables are painful to write manually. On the other hand, spreadsheet editors like Numbers or Excel make it easy to create tables with complex formulas. A long-time standard for exporting these tables is the comma separated value format (csv). You can compose a table in Excel, and then have all the calculations exported in a plain text csv. And now you can reference a csv file in iA Writer the same as way as images:

Using the first prototype simulating said behavior, we knew this was the way forward. It’s hard to describe how transclusion feels other than it just feels right. Beginning with a simple, straightforward syntax kept reaping dividends.

bpj commented 7 years ago

All spreadsheet programs can also export tab separated values (TSV) and those are very easy to convert into Markdown pipe tables using a decent text editor which supports regular expressions and preferably likewise substitutions. The following Perl program "tsv2pipetable.pl" takes a TSV file on stdin, converts it into a pipe table assuming that the first line is a header row and spits out the pipe table on stdout. You can use it as a filter in pp with

!exec(perl tsv2pipetable.pl <filename.tsv)
#!/usr/bin/env perl

# tsv2pipetable.pl - convert a file with tab separated values to a Pandoc pipe table
#
# Usage:
#
#     perl tsv2pipetable.pl <filename.tsv >pipetable.md

use strict;
use warnings;
use 5.008005
use open qw[ :utf8 :std ];

my $colfill = '-' x 5; # for the column spec row

# slurp all lines into an array
my @rows = <>;

# construct the column spec row

## make a copy of the first line
unshift @rows, $rows[0];

## replace column contents in the colspec row with five dashes each
$rows[1] =~ s/[^\t\n]+/$colfill/g;

# process each row

for ( @rows ) {
    ## escape existing pipes if any
    s/\|/\\|/g;
    ## replace tabs with pipes
    s/\t/|/g;
    ## insert a pipe at the start of each row
    s/^/|/;
}

# join the lines and print the pipetable to stdout
print STDOUT join "", @rows;

__END__

I hope this helps. Just ask if something isn't clear!

duzyn commented 7 years ago

That's very helpful. I forgot the powerful !exec() macro. I also found some others' project like these ones:

I'm using csv2md for my writings now with this macro:

\exec(csv2md 1.csv)
CDSoft commented 7 years ago

Good idea. To avoid external dependencies I can add a macro \csv that takes the name of a CSV file.

For example: \csv(file.csv)

Which table format in the output file would be the better? May be grid table to allow multiline cells.

The format could be deduced from the nature of columns (number are right-aligned, text is left aligned). If the csv file has no header, it can be given as a parameter.

\csv(file.csv)(field 1 | field 2 | ...) ==> add a header line
\csv(file.csv)() ==> headless table containing all the lines of the csv file
\csv(file.csv) ==> the first line of the csv file is used as the header line

The format of the input file could be inferred from the most frequent character amount some popular separators (',' ';' tab ...).

bpj commented 7 years ago

I also found some others' project like these ones:

Just make sure that they support quoted values and escaped comma/quote characters! The good thing with TSV is that tabs are simply illegal in values. Thus you are sure that all tabs are field separators and no quoting or escaping is needed.

CDSoft commented 7 years ago

I have added the csv macro. See https://github.com/CDSoft/pp#csv-tables

It's based on a Haskell CSV parser which supports quoted values. The separator is the most frequent separator in the file. I guess it should work most of the time. If not I'll add an optional parameter to define it.

tajmone commented 7 years ago

Thanks @duzyn and @CDSoft ! This new macro is really useful.

duzyn commented 5 years ago

pp convert the csv file to grid_tables which require lining up columns. Now there is a problem when converting CJK csv files, because pp can't line up CJK characters.

This is the original Chinese csv:

中文(Chinese),英文(English),价格(Price)
香蕉,Banana,3.9
苹果,Apple,4.5

And this is the converted table:

+-------------+-------------+-----------+
| 中文(Chinese) | 英文(English) | 价格(Price) |
+:============+:============+==========:+
| 香蕉          | Banana      |       3.9 |
+-------------+-------------+-----------+
| 苹果          | Apple       |       4.5 |
+-------------+-------------+-----------+

And this is the converted HTML by Pandoc:

<table style="width:58%;">
<colgroup>
<col style="width: 19%" />
<col style="width: 19%" />
<col style="width: 19%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">中文(Chinese</th>
<th style="text-align: left;">) | 英文(Eng</th>
<th style="text-align: right;">lish) | 价格(Price)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">香蕉</td>
<td style="text-align: left;"><div class="line-block">Banana</div></td>
<td style="text-align: right;"><div class="line-block">      3.9</div></td>
</tr>
<tr class="even">
<td style="text-align: left;">苹果</td>
<td style="text-align: left;"><div class="line-block">Apple</div></td>
<td style="text-align: right;"><div class="line-block">      4.5</div></td>
</tr>
</tbody>
</table>
bpj commented 5 years ago

That's hardly pp's fault. The problem is that most text editors still believe that fixed-width font means "single-width-font" while most CJK characters, unlike Western and most other characters notably including space characters, take up two width units in fixed width fonts. There are essentially two workarounds:

  1. Use a font where Western characters and space characters are the same width as CJK characters. I don't know if such a fixed-width font exists. The Western characters would probably look strange not to say ugly, or be separated by unnormal amounts of whitespace which also might be called ugly, in such a font.

  2. A tab (de)expansion program which takes the double-width of CJK characters into acount. I don't know if such a program exists, but it should not be forbiddingly hard to write if you use a language or library with enough Unicode savvy to tell CJK and non-CJK characters apart -- in fact I almost want to try to do so, except that I don't have the time.[1] That said it is probably not reasonable to expect that pp should include such functionality unless there already is a Haskell or C library which can be integrated. Something like Tabular.vim might contain hints how to write such a program. What you certainly can do with Tabular.vim is realigning pipe tables; I do it all the time! It is also pretty easy to define shell aliases for invoking pandoc with the table extensions of your choice disabled or enabled for reformatting tables between different formats (assuming that your editor can filter part of a file through an external program; as a Vim user I take that for granted I'm afraid!)

[1] A simple algorithm which may work:

  1. Use expand(1) or its equivalent normally.
  2. Split each line into chunks of characters one "tabstop" long.
  3. Count the number of CJK characters in each chunk.
  4. Remove that number of space characters at the end of the chunk if possible.
  5. If 4. isn't possible keep counting CJK characters until you reach a chunk where the required number of space characters can be removed.

This of course requires that 4 or 5 is possible within the scope of one table cell in the first place, and as far as Pandoc currently is concerned it requires that you use pipe tables which isn't always convenient.

Please note that I sympathize with you a lot. In my linguistic work I need to use different scripts, not including CJK though, and lots of combining marks. Most text editors suck at handling those. Vim at least doesn't break because of them! Unfortunately there is no easy fix in sight since most programming tools are built on the assumption or requirement of single-width fonts without any combining marks, if not ASCII.

Den 2018-11-24 kl. 11:14, skrev David Peng:

pp convert the csv file to grid_tables which require lining up columns. Now there is a problem when converting CJK csv files, because pp can't line up CJK characters.

tajmone commented 5 years ago

@duzyn:

Now there is a problem when converting CJK csv files, because pp can't line up CJK characters.

Probably you should consider using a pandoc filter to process PP's output file and cleanup the table cells alignment:

You can write filters in almost any scripting language you're comfortable with, Haskell and Lua being supported natively by pandoc and not requiring third party tools/languages to be installed.

I'd probably go for a Lua filter, and just iterate through every column twice: once to calculate the widest cell, the second pass to add trailing spaces to every cell that is shorter than that max value. Keeping track of Chinese chars which should count as two-spaces shouldn't be hard if you know the Unicode ranges you're looking for.

bpj commented 5 years ago

I just discovered that both Vim's :retab command and Tabular.vim take due care of double-width chars, though I don't know how. So if you are using Vim just install Tabular.vim, put this mapping into your .vimrc (all just on one line in case the mailer breaks the line!)

nnor <Leader>g2p vip:v/\|/d<cr>{jyyp:s/[^\|]/-/g<cr>gv:Tabu 
/\|<cr>j:s/[^\|]/-/g<cr>

then in normal mode place the cursor anywhere inside the grid table produced by pp an type \g2p and you should have a nicely lined-up pipe table!

Den 2018-11-24 kl. 19:27, skrev Benct Philip Jonsson:

That's hardly pp's fault. The problem is that most text editors still believe that fixed-width font means "single-width-font" while most CJK characters, unlike Western and most other characters notably including space characters, take up two width units in fixed width fonts. There are essentially two workarounds:

  1. Use a font where Western characters and space characters are the same width as CJK characters. I don't know if such a fixed-width font exists. The Western characters would probably look strange not to say ugly, or be separated by unnormal amounts of whitespace which also might be called ugly, in such a font.

  2. A tab (de)expansion program which takes the double-width of CJK characters into acount. I don't know if such a program exists, but it should not be forbiddingly hard to write if you use a language or library with enough Unicode savvy to tell CJK and non-CJK characters apart -- in fact I almost want to try to do so, except that I don't have the time.[1] That said it is probably not reasonable to expect that pp should include such functionality unless there already is a Haskell or C library which can be integrated. Something like [Tabular.vim] might contain hints how to write such a program. What you certainly can do with Tabular.vim is realigning pipe tables; I do it all the time! It is also pretty easy to define shell aliases for invoking pandoc with the table extensions of your choice disabled or enabled for reformatting tables between different formats (assuming that your editor can filter part of a file through an external program; as a Vim user I take that for granted I'm afraid!)

[1] A simple algorithm which may work:

  1. Use expand(1) or its equivalent normally.
  2. Split each line into chunks of characters one "tabstop" long.
  3. Count the number of CJK characters in each chunk.
  4. Remove that number of space characters at the end of the chunk if possible.
  5. If 4. isn't possible keep counting CJK characters until you reach a chunk where the required number of space characters can be removed.

This of course requires that 4 or 5 is possible within the scope of one table cell in the first place, and as far as Pandoc currently is concerned it requires that you use pipe tables which isn't always convenient.

[Tabular.vim]: https://www.google.com/search?q=vim+tabular

Please note that I sympathize with you a lot. In my linguistic work I need to use different scripts, not including CJK though, and lots of combining marks. Most text editors suck at handling those. Vim at least doesn't break because of them! Unfortunately there is no easy fix in sight since most programming tools are built on the assumption or requirement of single-width fonts without any combining marks, if not ASCII.

Den 2018-11-24 kl. 11:14, skrev David Peng:

pp convert the csv file to [grid_tables] which require lining up columns. Now there is a problem when converting CJK csv files, because pp can't line up CJK characters.

[grid\_tables]: http://pandoc.org/MANUAL.html#tables
bpj commented 5 years ago

Den 2018-11-24 kl. 20:06, skrev Tristano Ajmone:

I'd probably go for a Lua filter, and just iterate through every column twice: once to calculate the widest cell, the second pass to add trailing spaces to every cell that i shorter than that max value. Keeping track of Chines chars which should count as two-spaces shouldn't be hard if you know the Unicode ranges you're looking for.

That would actually be hell itself to do in Lua, because Lua knows nothing about Unicode. You would have to check for the right byte sequences --- good luck with that!

But before doing that you would need to have Pandoc parse the original table correctly which it demonstrably does not, which is a bigger problem than what the table looks like on screen. At least I guess that the OP could live with the table looking weird on screen if only Pandoc parsed it correctly. So you are back at square one --- a Pandoc filter is of no help!

However I have figured out how the OP can get Pandoc to parse the example table without very much manual work:

After all the only safe way is to give Pandoc a pipe table, because those are parsed correctly by Pandoc even if they don't line up visually/on screen, and thinking it through some more I think that pp actually could have an option to output pipe tables.

In the meantime the best the OP can do is to convert the grid table output by pp to a pipe table manually, which luckily isn't all that hard: just remove all lines which don't contain any pipes, insert a header separator row and voilà! It won't look nice on screen, but it will parse correctly, giving correct HTML output hopefully be rendered correctly by web browsers.

Step 1: remove all row separators except the top one:

+-------------+-------------+-----------+               <-- keep
| 中文(Chinese) | 英文(English) | 价格(Price) |
+:============+:============+==========:+               <-- remove
| 香蕉          | Banana      |       3.9 |
+-------------+-------------+-----------+               <-- remove
| 苹果          | Apple       |       4.5 |
+-------------+-------------+-----------+               <-- remove
+-------------+-------------+-----------+               <-- kept
| 中文(Chinese) | 英文(English) | 价格(Price) |
| 香蕉          | Banana      |       3.9 |
| 苹果          | Apple       |       4.5 |

Step 2: swap the two top rows:

| 中文(Chinese) | 英文(English) | 价格(Price) |
+-------------+-------------+-----------+               <-- moved
| 香蕉          | Banana      |       3.9 |
| 苹果          | Apple       |       4.5 |

Step 3: replace the pluses in the separator row with pipes and insert an alignment marker for the price column

| 中文(Chinese) | 英文(English) | 价格(Price) |
|-------------|-------------|-----------:|              <-- changed
| 香蕉          | Banana      |       3.9 |
| 苹果          | Apple       |       4.5 |

Step 4: convert with pandoc:

<table>
<thead>
<tr class="header">
<th>中文(Chinese)</th>
<th>英文(English)</th>
<th style="text-align: right;">价格(Price)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>香蕉</td>
<td>Banana</td>
<td style="text-align: right;">3.9</td>
</tr>
<tr class="even">
<td>苹果</td>
<td>Apple</td>
<td style="text-align: right;">4.5</td>
</tr>
</tbody>
</table>

In Vim invoking the following mapping in normal mode with the cursor somewhere in the grid table does the trick without any need for any plugin (although the mapping in my previous post using Tabular.vim will also give a good-looking pipe table!

nnor \g2p vip<esc>yygv:v/|/d<cr>{jp:s/+/|/g<cr>
bpj commented 5 years ago

I wrote:

nnor \g2p vipyygv:v/|/d{jp:s/+/|/g

forgetting to escape the pipes! It should read:

nnor \g2p vip<esc>yygv:v/\|/d<cr>{jp:s/+/\|/g<cr>
CDSoft commented 5 years ago

pp could also generate pipe tables directly (at least in markdown).

CDSoft commented 5 years ago

pp 2.7.1 generates pipe tables (in markdown only).

tajmone commented 5 years ago

Thanks Christophe!