Fixing "Line 1: Pacbio header line format error" problem by renaming error-corrected reads with falcon_name_fasta.pl

SchwarzEM commented 8 years ago

I previously encountered a problem (described in #249) where I could not assemble reads that had been error-corrected with PBcR-MHAP because they had had their names changed, and FALCON could not accept the altered names (and thus gave me the complaint, "Line 1: Pacbio header line format error"). With useful advice from Jason Chin, I devised a Perl script that will take an arbitrary FastQ reads file and rename its reads so that they are FALCON-usable, while leaving the sequence of the reads unchanged. It will also generate a name-conversion table, so that (if needed) identities of reads can be tracked.

I'm posting the script here, so that others don't have to devise the same script in order to overcome this problem.

Usage of the script is as described in its default help message:

    ./falcon_name_fasta.pl -h ;

Format: falcon_name_fasta.pl
    --infile|-i    <input stream/files>
    --outfile|-o   [output file name; default is (infile).outfile]
    --prefix|-p    [optional prefix; default is "falcon_read"]
    --table|-t     [table of old to new read names; default is "old2new_names.txt"]
    --help|-h      [print this message]

The script itself is as follows:

#!/usr/bin/env perl

# falcon_name_fasta.pl -- Erich Schwarz <ems394@cornell.edu>, 11/13/2015.
# Purpose: rename FASTA reads with format acceptable to FALCON; as side product, generate old-to-new name table.

use strict;
use warnings;
use Getopt::Long;

my $i            = 0;
my $prefix       = q{};
my $table        = q{};
my $infile       = q{};
my $outfile      = q{};
my $header       = q{};
my $orig_seqname = q{};
my @orig_seqs    = ();

my $data_ref;
my $help;

GetOptions ( 'infile=s'  => \$infile,
             'outfile:s' => \$outfile,
             'prefix:s'  => \$prefix,
             'table:s'   => \$table,
             'help'      => \$help,  );

if ( $prefix =~ /\W/xms ) {
    die "Cannot accept prefix with space or non-word characters (such as \".\"): $prefix\n";
}

if ( $outfile =~ /[^\w\.]/xms ) {
    die "Cannot accept outfile with space or non-standard characters: $outfile\n";
}

# Have human-readable default value:
$prefix ||= 'falcon_read';
$prefix = safename($prefix);

$outfile ||= "$infile.outfile";
$outfile = safename($outfile);

$table ||= 'old2new_names.txt';
$table = safename($table);

if ( $help or (! $infile ) ) { 
    die "Format: falcon_name_fasta.pl\n",
        "    --infile|-i    <input stream/files>\n",
        "    --outfile|-o   [output file name; default is (infile).outfile]\n",
        "    --prefix|-p    [optional prefix; default is \"falcon_read\"]\n",
        "    --table|-t     [table of old to new read names; default is \"old2new_names.txt\"]\n",
        "    --help|-h      [print this message]\n",
}

# Accept either a stream from '-' or a standard file.
my $INFILE;
if ($infile eq '-') {
    # Special case: get the stdin handle
    $INFILE = *STDIN{IO};
}
else {
    # Standard case: open the file
    open $INFILE, '<', $infile or die "Can't open input file $infile. $!\n";
}

open my $OUTFILE, '>', $outfile;
open my $TABLE, '>', $table;

while (my $input = <$INFILE>) { 
    chomp $input;
    if ( $input =~ /\A > (\S+) /xms ) { 
        $orig_seqname = $1;
        if ( exists $data_ref->{'orig_seqname'}->{$orig_seqname} ) { 
            die "Redundant sequence name: $orig_seqname\n";
        }
        $data_ref->{'orig_seqname'}->{$orig_seqname}->{'seen'} = 1;
        push @orig_seqs, $orig_seqname;
    }
    elsif ( $input =~ / \A \s* [A-Za-z] /xms ) { 
        $input =~ s/\s//g;
        if ( $input =~ / [^ACGTNacgtn] /xms ) { 
            die "Can't parse: $input\n";
        }
        $data_ref->{'orig_seqname'}->{$orig_seqname}->{'sequence'} .= $input;
    }
    else { 
        if ( $input !~ /\A \s* \z/xms ) { 
            die "Can't parse: $input\n";
        }
    }
}
close $INFILE;

foreach my $orig_seq1 ( @orig_seqs ) {
    $data_ref->{'orig_seqname'}->{$orig_seq1}->{'length'} 
        = length( $data_ref->{'orig_seqname'}->{$orig_seq1}->{'sequence'} );
}

my $seq_count = @orig_seqs;
my $DIGITS = length($seq_count);

foreach my $orig_seq2 (@orig_seqs) { 
    $i++;
    my $serial_no = $i;
    my $sf_format = '%0' . $DIGITS . 'u';
    $serial_no = sprintf($sf_format, $serial_no) or die "Can't zero-pad serial number $serial_no\n";

    my $new_name = $prefix . q{/} . $serial_no . q{/0_} . $data_ref->{'orig_seqname'}->{$orig_seq2}->{'length'};

    print $OUTFILE '>' . "$new_name\n";
    print $TABLE "$orig_seq2\t$new_name\n";
    my @output_lines 
        = unpack("a60" 
                 x ($data_ref->{'orig_seqname'}->{$orig_seq2}->{'length'}/60 + 1), 
                 $data_ref->{'orig_seqname'}->{$orig_seq2}->{'sequence'});
    foreach my $output_line (@output_lines) { 
        if ($output_line =~ /\S/) { 
            print $OUTFILE "$output_line\n";
        }
    }
}

close $OUTFILE;
close $TABLE;

sub safename {
    my $_filename = $_[0];
    my $_orig_filename = $_filename;
    if (-e $_orig_filename) {
        my $_suffix1 = 1;
        $_filename = $_filename . ".$_suffix1";
        while (-e $_filename) {
            $_suffix1++;
            $_filename =~ s/\.\d+\z//xms;
            $_filename = $_filename . ".$_suffix1";
        }
    }
    return $_filename;
}

Attached version is here:

falcon_name_fasta.pl.txt

SchwarzEM commented 8 years ago

And, after I posted this, I realized naming the script "falcon_name_fastq.pl" rather than "...fasta.pl" would have made considerably more sense! Please feel free to correct my naming. It doesn't affect the script's operation; I've used it and it worked (at least in my hands).

pb-cdunn commented 8 years ago

Seems pretty useful. When we have more time next week, we'll copy that into a user-contributions directory. Thanks!

SchwarzEM commented 8 years ago

Great!

When you get a chance, please run the following one-liner (or whatever Pythonic etc. equivalent you would prefer) on the file that I submitted:

    cat falcon_name_fasta.pl.txt | perl -ne ' s/falcon_name_fasta/falcon_name_fastq/g; print; ' > falcon_name_fastq.pl ;

That should give you a Perl script with a Perl-ish file suffix, in which my typo of 'fasta' has been corrected both in the filename and in the script itself.

Juke34 commented 4 years ago

Thank you for your script @SchwarzEM it works well on fasta. I have not tried it on fastq

SchwarzEM commented 4 years ago

Wow! Good to see that the script is still useful, 4.5 years later...

PacificBiosciences / FALCON

Fixing "Line 1: Pacbio header line format error" problem by renaming error-corrected reads with falcon_name_fasta.pl #251