alanlorenzetti / frtc

An automatic pipeline for prokaryotic RNA-Seq Analysis
GNU General Public License v3.0
1 stars 0 forks source link

mmr error #1

Closed fcgportal closed 6 years ago

fcgportal commented 6 years ago

Hi, Thanks for providing such a amazing pipeline! I am trying to use your pipeline to process public RNAseq data such as SRR6880860. But after alignment, mmr keep running error, saying "ERROR: Header information incomplete!". Do you know how this happened, and how can I resolve it? Thank you again.

alanlorenzetti commented 6 years ago

Hi. Thank you for the kind words.

I'm sorry for the late response, I was trying to process the library you have mentioned. Well, I couldn't reproduce the issue you have experienced. Every step was performed. My guess it's something related to the software versions. This pipeline checks if the programs are installed, but doesn't have any version check. If you are not using the latest version of Samtools, I would start by installing it.

Before debugging this issue, I would suggest you to read my report, because the approach might not be the best for your data.

To download the lib I used:

wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR688/SRR6880860/SRR6880860.sra

The SRA file was converted to FASTQ using:

fastq-dump --split-3 SRR6880860.sra

I had to change the script, because the pipeline was conceived to process prokaryotic RNA-Seq data (i.e. no splicing). Sorry, but I forgot to mention this on the documentation, so I'm going to add a disclaimer. To allow splicing, I removed the "--no-spliced-alignment" from hisat2 call.

The directories were created as explained in the docs, and the following adap.fa file was created:

>firstAdapt
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>secondAdapt
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

I'm pasting the log file below:

From Raw To Coverage (frtc):
A tool to process Illumina RNASeq data.
Version: 0.6.0
Last update: 20180827
Sat Sep 15 17:04:04 -03 2018
Call: frtc.sh 18 1000 50 Hsapiens ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.26_GRCh38/GCF_000001405.26_GRCh38_genomic.fna.gz
Checking dependencies...
Done!
Step 1: Starting Trimming
Trimming SRR6880860
Done!
Step 2: Starting Alignment to ref. Genome
Downloading genome
Downloading annotation
Building HISAT2 index
Done!
Aligning SRR6880860 (paired)
31968195 reads; of these:
  31968195 (100.00%) were paired; of these:
    1917945 (6.00%) aligned concordantly 0 times
    26277632 (82.20%) aligned concordantly exactly 1 time
    3772618 (11.80%) aligned concordantly >1 times
94.00% overall alignment rate
Aligning SRR6880860 (unpaired R1)
2085567 reads; of these:
  2085567 (100.00%) were unpaired; of these:
    294791 (14.13%) aligned 0 times
    1408111 (67.52%) aligned exactly 1 time
    382665 (18.35%) aligned >1 times
85.87% overall alignment rate
Aligning SRR6880860 (unpaired R2)
1168564 reads; of these:
  1168564 (100.00%) were unpaired; of these:
    102964 (8.81%) aligned 0 times
    782696 (66.98%) aligned exactly 1 time
    282904 (24.21%) aligned >1 times
91.19% overall alignment rate
Done!
Step 3: Filtering uniquely aligned reads
Done!
Step 4.1: Converting to BAM and sorting by read name
Done!
Step 4.2: Converting uniquely aligned to BAM and sorting
Done!
Step 5: Adjusting position of multi aligned reads
samtools for writing subprocess terminated successfully
[bam_sort_core] merging from 0 files and 18 in-memory blocks...
Done!
Step 6.1: Creating coverage files
[bam_sort_core] merging from 0 files and 18 in-memory blocks...
[bam_sort_core] merging from 0 files and 18 in-memory blocks...
SRR6880860 has 7259717 pair(s) containing reads that do(es) not match each other. Removing.
Done!
Step 6.2: Creating coverage files for uniquely aligned reads
Done!
Step 7: Creating TSSAR input files
Done!
Step 8: Creating five prime profiling files
Done!
Step 9: Creating three prime profiling files
Done!

Everything worked, but the removal of 7259717 pairs from the final alignment file was unexpected. This step is performed to remove inconsistent pairs as reported here before, but they are usually just a few (1/10000).

I checked a few of those pairs and they are really strange:

SRR6880860.426  83  NT_187388.1 171126  1   51M =   125967  -45210  TTCCTTTGGTCGCTCGCTCCTCTCCTACTTGGATAACTGTGGTAATTCTAG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIDDDDD AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:51 YS:i:0  YT:Z:CP XS:A:-  NH:i:10
SRR6880860.426  163 NT_187388.1 171048  1   51M =   171126  129 CAAAGATTAAGCCATGCATGTCTAAGTACGCACGGCCGGTACAGTGAAACT DDDDDIIIHHHIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIHIIIII AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:51 YS:i:0  YT:Z:CP XS:A:-  NH:i:10

Field#9 of the first entry reports a fragment size of -45210, which would not be possible using the constraint of maximum insert size of 1000. This may represent a possibility I didn't consider when working with small prokaryotic genomes, and therefore you may need to adapt this pipeline to work with eukaryotic data.

Maybe the coverage_uniq is still suitable for your purpose, however I strongly suggest not using the results generated by the multi-mapper resolution approach until you figure out what is happening to these pairs.

I'm just curious if the pipeline works for eukaryotic RNA-Seq data, so tomorrow I will check the transcript signal to evaluate if the coverage_uniq makes sense.

Kind regards, Alan.

alanlorenzetti commented 6 years ago

I have inspected the coverage_uniq and bam_uniq files (i.e. signal and alignments of reads aligning on a single position on the genome) and although the processing was performed as expected, the signal is extended between exon regions (spanning introns). See the screenshot below:

srr6880860

Maybe this can be adjusted using additional parameters in deepTools, but I would not recommend to use this pipeline as it is to process eukaryotic data. Modifications are indeed necessary to inspect the transcriptional signal. Despite that, I think the bam_uniq files are ready to input in downstream analysis such as differential expression profiling.

Also, note that this library was prepared using NEBNext® Ultra™ II Directional RNA Library Prep Kit for Illumina®, which provides sequenced reads corresponding to the reverse complement of the sample RNA (dUTP-based, or similar, not sure). In this way, if you not provide this information to the software (which is the default behavior of this pipeline), you will have to "invert" the signal tracks as I did in this IGV session (see screenshot above) and also inform further applications relying on these files.

Well, these are my considerations concerning this topic. I hope they helped you to understand a few aspects of the workflow. I'm sorry, but you will have to adapt the pipeline for your own purpose.

These files are available at my gdrive for a limited time if you would like to download them.

Kind regards, Alan.

fcgportal commented 6 years ago

Thank you so much for your reply. It's really helpful for me. I am trying to do some modification and adapt this pipeline to process human RNA-Seq data. Thank you again.