forTEXT / catma

Computer Assisted Text Markup and Analysis
https://www.catma.de
GNU General Public License v3.0
88 stars 8 forks source link

Formatting of document lost #229

Closed mado89 closed 3 years ago

mado89 commented 3 years ago

Hello,

for a project we want to analyse source code. Is there a way that the formatting of the text does not get lost? Source code is usually formatted using tabs and spaces. For this project it is important that I would be able to see if a line was indented using something like a tab or two spaces. Would this be possible? I tried converting the document to html or pdf both with the same result (I guess this is due to the calibre engine in the background?)

Thanks for the help!

mpetris commented 3 years ago

Hi @mado89
can you try renaming it to .txt? I think the plain/text import leaves whitespace intact. It is actually a Tika engine in the background which sometimes does quite unexpected things. Another option would be to make it an XML file by wrapping it in a root node. The XML import also leaves whitespace intact and it does not use the Tika engine.

mado89 commented 3 years ago

I tried renaming it to .txt and wrapping it in an XML but both times the whitespace was removed at the beginning of the lines

mpetris commented 3 years ago

Can you send me an example file to the support address at https://catma.de/contact/ I tried plain text with a Java source code file and it seem to work fine.

mado89 commented 3 years ago

I just wanted to check whether my mail was received?

mpetris commented 3 years ago

@mado89 Yes, thanks. Your example contains single tab characters which are rendered in the Annotate module with a single  . So the indentation is actually there but it is less obvious. In sourcecode editors tabs are usually rendered with two or four spaces. Or it is even configurable. So if you replace each tab with four spaces, then the indentation would be more obvious in the Annoate module.

mado89 commented 3 years ago

Thanks for the update and your help!