EddyRivasLab / hmmer

HMMER: biological sequence analysis using profile HMMs
http://hmmer.org
Other
307 stars 69 forks source link

[FEATURE REQUEST] Make tabular output files tab-delimited #235

Closed ivagljiva closed 3 years ago

ivagljiva commented 3 years ago

Hello! I was wondering if you would be willing to change the tabular output formats (--tblout and --domtblout) to be tab-delimited rather than space-delimited. Right now fields in these output files are separated by a variable number of spaces in each line, which aligns the columns nicely and looks very pretty, but is difficult to parse in downstream code. image Converting these runs of spaces to a single tab between each field would not look as good, but it would make it so much easier for programmers who are working with these output files downstream.

For additional context, I am one of the current developers of anvi'o. We are big fans of your work and have been relying on HMMER in a variety of contexts in anvi'o. Sadly, our current parsing capabilities for HMMER tabular output are rather inadequate and we've lately been encountering more circumstances in which it just fails (for instance, here is one example). Of course we are hoping to fix these parsing issues on our end, but we thought it might be both easier and useful to the wider community to request this change from you all, since there must be plenty of other groups who are working directly with HMMER output and could benefit from the convenience of tab-delimited output files.

If this change seems reasonable to you, I'd be happy to implement it and open a PR. I just wanted to open a discussion with you all first to see if it is something you would even consider altering in your codebase. :)

Thanks for your consideration!

Iva

npcarter commented 3 years ago

Sean's the person who can comment on your feature request, though my take is that if we decided to implement it, we'd want to do so as an optional format to avoid breaking code that works with the current format.

Looking at the anvi'o GitHub page, it looks like a tool I should look into more, but I see that most of your code is written in Python. I've had good luck parsing tblout and domtblout files in Python by using the split() function to turn each line of text into a list of fields, via something like fields = line.split(). Split seems to deal fine with variable-length runs of spaces, such that I haven't seen any problems with it, though I haven't done a vast amount of work with these output file formats.

Could you comment on how you're trying to parse these output files?

-nick

Nick Carter - Chat @ Spike [yvwtc]

On April 1, 2021 at 21:08 GMT, Iva Veseli @.***> wrote:

Hello!

I was wondering if you would be willing to change the tabular output formats (--tblout and --domtblout) to be tab-delimited rather than space-delimited. Right now fields in these output files are separated by a variable number of spaces in each line, which aligns the columns nicely and looks very pretty, but is difficult to parse in downstream code. image Converting these runs of spaces to a single tab between each field would not look as good, but it would make it so much easier for programmers who are working with these output files downstream.

For additional context, I am one of the current developers of anvi'o. We are big fans of your work and have been relying on HMMER in a variety of contexts in anvi'o. Sadly, our current parsing capabilities for HMMER tabular output are rather inadequate and we've lately been encountering more circumstances in which it just fails (for instance, here is one example). Of course we are hoping to fix these parsing issues on our end, but we thought it might be both easier and useful to the wider community to request this change from you all, since there must be plenty of other groups who are working directly with HMMER output and could benefit from the convenience of tab-delimited output files.

If this change seems reasonable to you, I'd be happy to implement it and open a PR. I just wanted to open a discussion with you all first to see if it is something you would even consider altering in your codebase. :)

Thanks for your consideration!

Iva

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

ivagljiva commented 3 years ago

Thanks for your very prompt reply, @npcarter!

That is indeed how we currently parse the tblout files. Here's the relevant code snippet in case you are interested:

for line_bytes in hmm_hits_file:
                line_counter += 1
                line = line_bytes.decode('ascii', 'ignore')

                if not len(line) == len(line_bytes):
                    lines_with_non_ascii.append(line_counter)
                    detected_non_ascii = True

                if line.startswith('#'):
                    if not clip_index_found and line.find('description') != -1:
                        # This parser removes the description column from the data
                        clip_description_index = line.find('description')
                        clip_index_found = True

                    continue

                with buffer_write_lock:
                    merged_file_buffer.write('\t'.join(line[:clip_description_index].split()) + '\n')

Our issues stem from the fact that we currently try to remove the description column prior to splitting because that column can internally include spaces (so resulting files could have different numbers of fields in each line, which generally breaks things downstream). It doesn't work in all cases because the clip index may not be the same in each line. So clearly that is not such a good move on our part, and I am working on fixing it :) But I thought perhaps it would be easier to fix it at the source, if you all thought it was a good idea. I am looking forward to hearing Sean's opinion on it.

if we decided to implement it, we'd want to do so as an optional format to avoid breaking code that works with the current format.

That sounds just fine to me :)

meren commented 3 years ago

Thank you very much for bringing this up, @ivagljiva, and thank you @npcarter for listening! I'm posting this message mainly to make sure GitHub will keep me in the loop, but I thought I could comment on this:

if we decided to implement it, we'd want to do so as an optional format to avoid breaking code that works with the current format

Thanks for being considerate of backwards compatibility. Perhaps adding a flag to let the user to explicitly ask for a simpler format could be a way to avoid any snafu. But of course Sean and others will know the best course of action.

We are a big fun of your work, and thank you again.

cryptogenomicon commented 3 years ago

In Python,

    fields = line.split(maxsplit=3)

for example, will split line on whitespace into 3 columnar data fields, and leave the remainder of the line (i.e. free text description or whatnot) in fields[3].

Other languages and packages generally have equivalently easy ways to parse whitespace-delimited lines into fields + remaining free text.

I have strong reasons to prefer whitespace-delimited and column-aligned tabular output files. It is extremely important to make outputs both parsable and easily human readable. If outputs aren't human readable (and aren't checked by humans), analyses are artifact prone. Our tabular output formats are designed to be easily downsampled, sorted, filtered, and examined by hand, using basic command-line tools, not just parsed in automated pipelines.

That said, we plan to provide tab-delimited output options in HMMER4, since many people have requested this - even though I think you're all horribly wrong. I think the right solution is to know how to parse whitespace-delimited files, which we do routinely and easily.

meren commented 3 years ago

Thanks for the response, @cryptogenomicon.

I have strong reasons to prefer whitespace-delimited and column-aligned tabular output files. It is extremely important to make outputs both parsable and easily human readable. If outputs aren't human readable (and aren't checked by humans), analyses are artifact prone.

I will share my 2 cents on this point only because you mentioned that we are "all horribly wrong". I thought it would have been almost rude to not bite.

I do agree that any program that operates on user data should make its outputs accessible to human reading. This is quite parallel to our philosophy: we want our users to have everything they need to scrutinize every result anvi'o generates without trouble. But there are better ways to achieve that than generating whitespace-delimited files so things look aligned to human eye. Like many software tools anvi'o reports TAB-delimited files by default. But we also have global flags for human readable output. For instance, any program that produces TAB-delimited files as an output can also report their output as markdown-formatted content if the user simply includes the flag --as-markdown in any program call in the terminal. So, just as an example here, if I include the flag --as-markdown as I run anvi-get-aa-counts on some mock data, I can literally copy-paste the output the program produces here directly (and it would be rendered nicely):

source Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro STP Ser Thr Trp Tyr Val
Bin_1 933 527 272 586 82 305 520 668 204 477 756 311 236 263 396 23 461 523 92 193 700
Bin_2 328 160 255 218 70 203 259 323 88 524 569 347 157 248 131 14 323 259 28 255 361
Bin_3 343 157 128 181 80 114 258 298 61 267 343 238 124 138 130 9 204 201 23 143 276

If the purpose is to give the user a means to be able to scrutinize things easily, this output is of course much more useful and readable than any whitespace-delimited output for certain media. For instance, if the target medium supports proper handling of markdown tables, others looking at the output can even sort or filter the output by column and so on. In addition to the on-the-fly markdown conversion, the user can display the same output in their terminal as an ASCII table if they wish to --here is a screenshot from my terminal:

image

Not to mention any basic command line tools will as easily work with TAB-delimited output files as they do with files that contain columns separated by arbitrary number of whitespaces. We are happy to improve these options when/if anyone makes a reasonable request. HMMER could indeed offer an option to produce its outputs as TAB-delimited files, but I understand that technically feasible solutions can't stand against the power of personal preferences.

I think the right solution is to know how to parse whitespace-delimited files

This is just one solution. But I don't see how it is the right one that makes everyone else horribly wrong. As the mighty upstream you can of course tell us to go fly a kite and we would do it. But the right solution is to write versatile code that can accommodate versatile needs without forcing downstream users or programmers to deal with personal preferences whenever possible. Which many do routinely and easily.

Best wishes,

snayfach commented 3 months ago

This issue was closed, but the feature request was not addressed.

A tab delimited format that could be easily read into a pandas dataframe alongside unique field names would be greatly appreciated by many.

Users shouldn't need to write custom parsers in python....

FilipeFerreiraMOCHSL commented 2 weeks ago

Say your output is "test.out". It's not ideal, but these shell commands may help, worked for me:

# Saves the header
head -n2 test.out | tail -n1 | awk -F'  +' -v OFS='\t' '{sub(/ +$/,""); $1=$1}1' | sed 's/# //g' > header

# Edits some wrong delimiters
## for the "--domtblout" option
sed -i 's/tlen query/tlen\tquery/g' header; sed -i 's/acc description/acc\tdescription/g' header
## for the "--tblout" option
sed -i 's/exp reg clu/exp\treg\tclu/g' header; sed -i 's/ov env dom rep inc description of target/ov\tenv\tdom\trep\tinc\tdescription of target/g' header

# Saves the results
sed '/^#/ d' test.out | tr -s ' ' | tr ' ' '\t' > body

# Merges to the final file
cat header body > test.tsv; rm header body