fstrozzi / Fastool

Simple and quick FastQ and FastA tool for file reading and conversion
16 stars 6 forks source link

Fastool does not properly parse SRA files #5

Open tsackton opened 9 years ago

tsackton commented 9 years ago

When processing SRA RNA-seq fastq files with Fastool as part of the Trinity package, Fastool appends a /H to the end of sequence ids which then causes errors downstream in Trinity.

Here are the first few lines of an SRA file: https://gist.github.com/tsackton/8c5508a4b60a1e33f6f2

When I run: fastool --to-fasta --illumina-trinity sra_test.fq > sra_test.1.fa , the output headers look like this:

SRR488565.1/H SRR488565.2/H SRR488565.3/H SRR488565.4/H SRR488565.5/H SRR488565.6/H

If I remove everything after the first space in the sra example (with seqtk seq -C), the output is normal:

SRR488565.1 SRR488565.2 SRR488565.3 SRR488565.4 SRR488565.5 SRR488565.6

The /H files do not work with Trinity, while the normal files after seqtk seq -C processing do.

This is tested with the latest version of fastool, compiled on Centos 6 with gcc 4.8.2

fstrozzi commented 9 years ago

Hi, this is due to the SRA file header, the --illumina-trinity option called by Trinity was meant to be used with Illumina FastQ files with their typical header. In this case a quick work around would be to run fastool alone first on the R1 and R2 dataset with the options:

fastool --append /1 --to-fasta SRA_1.fastq > SRA_1_fixed.fastq
fastool --append /2 --to-fasta SRA_2.fastq > SRA_2_fixed.fastq

And then start Trinity with the "fixed" files, this should work.

nickxzshi commented 7 years ago

very thanks for solving the problem which i am having