cougarlj / COMPSRA

COMPSRA: a COMprehensive Platform for Small RNA-Seq data Analysis
https://regepi.bwh.harvard.edu/circurna/
GNU General Public License v3.0
16 stars 6 forks source link

STAR removed most of the reads as short #20

Open sagarutturkar opened 3 years ago

sagarutturkar commented 3 years ago

I ran the QC and alignment modules on my own data with hg28. My original reads are 75bp single-end. After QC, I get most (>90%) of the reads with length >60 bp.

#The length distribution of the trimmed reads is listed:
--
#Length | Count | Percentage
62 | 317304 | 1.01%
63 | 825262 | 2.62%
64 | 1288995 | 4.09%
65 | 4962251 | 15.74%
66 | 14012034 | 44.45%
67 | 9822387 | 31.16%
However, after STAR alignment, I get Parameter counts
Number of input reads 31525929
Average input read length 65
UNIQUE READS:
Uniquely mapped reads number 957201
Uniquely mapped reads % 3.04%
Average mapped length 61
Number of splices: Total 0
Number of splices: Annotated (sjdb) 0
Number of splices: GT/AG 0
Number of splices: GC/AG 0
Number of splices: AT/AC 0
Number of splices: Non-canonical 0
Mismatch rate per base, % 0.28%
Deletion rate per base 0.01%
Deletion average length 1
Insertion rate per base 0.00%
Insertion average length 1.21
MULTI-MAPPING READS:
Number of reads mapped to multiple loci 2632509
% of reads mapped to multiple loci 8.35%
Number of reads mapped to too many loci 0
% of reads mapped to too many loci 0.00%
UNMAPPED READS:
% of reads unmapped: too many mismatches 0.00%
% of reads unmapped: too short 88.61%
% of reads unmapped: other 0.00%
CHIMERIC READS:
Number of chimeric reads 0
% of chimeric reads 0.00%

I get 88% of the reads as "% of reads unmapped: too short". Do you have any suggestions?

cougarlj commented 3 years ago

Dear Sagarutturkar,

If your data is miRNA sequencing data, after the QC model, the read length distribution should be like:

The length distribution of the trimmed reads is listed:

--

Length | Count | Percentage

17 | 317304 | 1.01% 18 | 825262 | 2.62% 19 | 1288995 | 4.09% 20 | 4962251 | 15.74% 21 | 14012034 | 44.45% 22 | 9822387 | 31.16%

So, I suspected the adapter was not removed from the original read. Please check the prepare kit used in the experiment and provide the right adapter sequence in the command line.

Best Wishes, Jiang Li

sagarutturkar commented 3 years ago

Thank you for your quick response. This data is mix of multiple types of smallRNAs (microRNA, snoRNA, snRNA, tRNA) etc.

After I specified the right adapter sequence, "too short" is corrected. However, for several samples, I get very high percentage of reads as "mapped to multiple loci" which I suspect mostly the short reads (<30 bp) belonging to miRNA.

I had previously ran mirDeep2 for miRNA detection, which performs read collapsing step that in turn helps to keep multi-mapping reads at low number.

From mirDeep2:

Option '-m' will collapse the reads to remove redundancy and decrease the file size. 
A sequencing read seen 10 times in your raw file will occur only once in the collapsed 
file and have a _x10 in its identifier.

I was wondering if COMPSRA will follow any such collapsing step while miRNA analysis?

cougarlj commented 3 years ago

Dear sagarutturkar,

COMPSA doesn't have the collapsing step. The reason for "mapped to multiple loci" , based on my knowledge, may be that:

  1. Some miRNAs have multi copies in different chromsomes.
  2. Some piRNAs are overlapped with miRNAs.
  3. Some miRNA families are difficult to identify (only one bp difference).

Specially, the miRNAs with the prefix "let-" always have lots of troubles which you can focus on. Hope these can help you.

Best Wishes, Jiang Li